1 Introduction

Time series occur in most fields of study that produce quantitative data. Whenever quantities are measured over time, those measurements form a time-series, or more formally, a discrete-time stochastic process.

1.1 Workshop Materials

All materials for this workshop is available in my standard GitHub repo:

https://github.com/kaybenleroll/data_workshops

The content of this workshop is partly based on the book “Introductory Time Series with R” by Paul Cowpertwait and Andrew Metcalfe. The data from this book is available from the website:

http://www.maths.adelaide.edu.au/emac2009/

For further resources, anyone with an interest in this topic should take a look at the book “Forecasting: Principles and Practice” by Rob J Hyndman and George Athanasopoulos

While available as in dead-tree form, the full text is available online:

https://otexts.com/fpp2/

One final recommendation is the book “The Analysis of Time Series” by Christopher Chatfield. This is a principles and theory-based book on the underlying statistical concepts around time-series analysis and does not contain much in the way of code, but is still highly recommended.

1.1.1 R Packages

In this workshop we focus on the use of ‘tidy’-style tools in the analysis of time-series. In particular we look at packages such as tidyquant that enable and simplify this approach to time-series analysis.

1.2 Basic Concepts

A famous example of a time-series is count of airline passengers in the US, as shown in the figure below. This is a simple univariate time-series, with measurements taken on a monthly basis over a number of years, each datum consisting of a single number - the number of passengers travelling via a commercial flight that month.

plot(AirPassengers)

Before we begin analysing such data, we first need to create a mathematical framework to work in. Fortunately, we do not need anything too complicated, and for a finite time-series of length \(N\), we model the time series as a sequence of \(N\) random variables, \(X_i\), with \(i = 1, 2, ..., N\).

Note that each individual \(X_i\) is a wholly separate random variable: we only ever have a single measurement from each random variable. In many cases we simplify this, but it is important to understand and appreciate that such simplifications are just that. Time series are difficult to analyse.

Before we get to any of that though, and before we try to build any kind of models for the data, we always start with visualising the data. Often, a simple plot of the data helps use pick out aspects to analyse and incorporate into the models. For time series, one of the first things to do is the time plot, a simple plot of the data over time.

For the passenger data, a few aspects stand out that are very common in time series. It is apparent that the numbers increase over time, and this systematic change in the data is called the trend. Often, approximating the trend as a linear function of time is adequate for many data sets.

A repeating pattern in the data that occurs over the period of the data (in this case, each year), is called the seasonal variation, though a more general concept of ‘season’ is implied — it often will not coincide with the seasons of the calendar.

A slightly more generalised concept from the seasonality is that of cycles, repeating patterns in the data that do not correspond to the natural fixed periods of the model. None of these are apparent in the air passenger data, and accounting for them are beyond the scope of this introductory tutorial.

Finally, another important benefit of visualising the data is that it helps identify possible outliers and erroneous data.

In many cases, we will also be dealing with time series that have multiple values at all, many or some of the points in time.

Often, these values will be related in some ways, and we will want to analyse those relationships also. In fact, one of the most efficient methods of prediction is to find leading indicators for the value or values you wish to predict — you can often use the current values of the leading indicators to make inference on future values of the related quantities.

The fact that this is one of the best methods in time series analysis says a lot about the difficulty of prediction (Yogi Berra, a US baseball player noted for his pithy statements, once said “Prediction is difficult, especially about the future”).

1.3 Example Timeseries

In this workshop we will look at a number of different time-series, discussed here.

This data comes in a few different format, and this workshop discusses methods for analysing this data in a common format.

1.3.1 Air Passenger Data

As mentioned previously, a canonical time-series is the airline passenger dataset, and this is the first dataset we look at.

AirPassengers %>% print()
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
AirPassengers %>% plot()

In this workshop we will convert all time series into the tibbles: the package timetk allows us to do this.

airpassengers_tbl <- AirPassengers %>% tk_tbl(rename_index = 'month')

airpassengers_tbl %>% print()
## # A tibble: 144 x 2
##    month         value
##    <S3: yearmon> <dbl>
##  1 Jan 1949        112
##  2 Feb 1949        118
##  3 Mar 1949        132
##  4 Apr 1949        129
##  5 May 1949        121
##  6 Jun 1949        135
##  7 Jul 1949        148
##  8 Aug 1949        148
##  9 Sep 1949        136
## 10 Oct 1949        119
## # ... with 134 more rows
ggplot(airpassengers_tbl) +
    geom_line(aes(x = month, y = value)) +
    xlab('Date') +
    ylab('Passenger Count') +
    expand_limits(y = 0) +
    ggtitle('Plot of Air Passenger Time Series')

1.3.2 Maine Unemployment Data

Time series are very common in econometrics, and a dataset provided in the text is that of monthly unemployment statistics in Maine from 1996 on. I have included the datafile in this workshop.

maine_ts <- scan('data/Maine.dat', skip = 1) %>%
    ts(start = 1996, frequency = 12)

maine_ts %>% print()
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1996 6.7 6.7 6.4 5.9 5.2 4.8 4.8 4.0 4.2 4.4 5.0 5.0
## 1997 6.4 6.5 6.3 5.9 4.9 4.8 4.5 4.0 4.1 4.3 4.8 5.0
## 1998 6.2 5.7 5.6 4.6 4.0 4.2 4.1 3.6 3.7 4.1 4.3 4.0
## 1999 4.9 5.0 4.6 4.3 3.9 4.0 3.6 3.3 3.1 3.3 3.7 3.7
## 2000 4.4 4.4 4.1 3.5 3.1 3.0 2.8 2.5 2.6 2.8 3.1 3.0
## 2001 3.9 4.2 4.0 4.1 3.5 3.5 3.4 3.1 3.4 3.7 4.0 4.0
## 2002 5.0 4.9 5.0 4.7 4.0 4.2 4.0 3.6 3.7 3.9 4.5 4.6
## 2003 5.6 5.8 5.6 5.5 4.8 4.9 4.8 4.3 4.5 4.6 4.8 4.7
## 2004 5.6 5.6 5.5 4.8 4.2 4.3 4.2 3.8 4.0 4.2 4.6 4.6
## 2005 5.5 5.8 5.5 5.2 4.7 4.6 4.5 4.1 4.4 4.4 4.8 4.6
## 2006 5.3 5.6 4.9 4.6 4.2 4.4 4.4 3.9
maine_ts %>% plot()

As before, we convert this data into a tibble and recreate the plot using ggplot2.

maine_tbl <- maine_ts %>% tk_tbl(rename_index = 'month')

maine_tbl %>% print()
## # A tibble: 128 x 2
##    month         value
##    <S3: yearmon> <dbl>
##  1 Jan 1996        6.7
##  2 Feb 1996        6.7
##  3 Mar 1996        6.4
##  4 Apr 1996        5.9
##  5 May 1996        5.2
##  6 Jun 1996        4.8
##  7 Jul 1996        4.8
##  8 Aug 1996        4  
##  9 Sep 1996        4.2
## 10 Oct 1996        4.4
## # ... with 118 more rows
ggplot(maine_tbl) +
    geom_line(aes(x = month, y = value)) +
    xlab('Date') +
    ylab('Unemployment Numbers') +
    expand_limits(y = 0) +
    ggtitle('Plot of Maine Unemployment Time Series')

1.3.3 Australian Consumption Statistics (CBE)

Governments produce regular data on consumption numbers for their economy.

One such dataset is contained in the file cbe.dat, produced by the Australian Census Bureau containing data of chocolate, beer and energy production on a monthly basis.

cbe_raw_tbl <- read_tsv('data/cbe.dat')

cbe_raw_tbl %>% glimpse()
## Observations: 396
## Variables: 3
## $ choc <dbl> 1451, 2037, 2477, 2785, 2994, 2681, 3098, 2708, 2517, 2445, 20...
## $ beer <dbl> 96.3, 84.4, 91.2, 81.9, 80.5, 70.4, 74.8, 75.9, 86.3, 98.7, 10...
## $ elec <dbl> 1497, 1463, 1648, 1595, 1777, 1824, 1994, 1835, 1787, 1699, 16...

Similar to the Maine file, this data does not contain time indices for the data. For the sake of completeness, we use the same approach as before and convert to a tibble, but we will also show how to construct the time index without having to do intermediate conversions.

First we add time indices via ts conversions.

cbe_ts <- cbe_raw_tbl %>%
    as.matrix() %>%
    ts(start = 1958, frequency = 12)

cbe_ts %>% head(10)
##          choc beer elec
## Jan 1958 1451 96.3 1497
## Feb 1958 2037 84.4 1463
## Mar 1958 2477 91.2 1648
## Apr 1958 2785 81.9 1595
## May 1958 2994 80.5 1777
## Jun 1958 2681 70.4 1824
## Jul 1958 3098 74.8 1994
## Aug 1958 2708 75.9 1835
## Sep 1958 2517 86.3 1787
## Oct 1958 2445 98.7 1699
cbe_ts %>% plot()

An alternative approach is to add the time index directly.

n_data <- cbe_raw_tbl %>% nrow()

cbe_tbl <- cbe_raw_tbl %>%
    add_column(month = (1958 + (0:(n_data-1)/12)) %>% yearmon, .before = 1)


cbe_tbl %>% glimpse()
## Observations: 396
## Variables: 4
## $ month <S3: yearmon> Jan 1958, Feb 1958, Mar 1958, Apr 1958, May 1958, Jun...
## $ choc  <dbl> 1451, 2037, 2477, 2785, 2994, 2681, 3098, 2708, 2517, 2445, 2...
## $ beer  <dbl> 96.3, 84.4, 91.2, 81.9, 80.5, 70.4, 74.8, 75.9, 86.3, 98.7, 1...
## $ elec  <dbl> 1497, 1463, 1648, 1595, 1777, 1824, 1994, 1835, 1787, 1699, 1...
cbe_tbl %>% print()
## # A tibble: 396 x 4
##    month          choc  beer  elec
##    <S3: yearmon> <dbl> <dbl> <dbl>
##  1 Jan 1958       1451  96.3  1497
##  2 Feb 1958       2037  84.4  1463
##  3 Mar 1958       2477  91.2  1648
##  4 Apr 1958       2785  81.9  1595
##  5 May 1958       2994  80.5  1777
##  6 Jun 1958       2681  70.4  1824
##  7 Jul 1958       3098  74.8  1994
##  8 Aug 1958       2708  75.9  1835
##  9 Sep 1958       2517  86.3  1787
## 10 Oct 1958       2445  98.7  1699
## # ... with 386 more rows

Having constructed the tibble, we now construct these time series plots using ggplot2.

plot_tbl <- cbe_tbl %>%
    gather('product', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = product)) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Production Amount") +
    scale_y_continuous(labels = comma) +
    ggtitle('Production Data from Australian Government')

Due to the different scales, it might be more useful to use a faceted plot for each product:

plot_tbl <- cbe_tbl %>%
    gather('product', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value)) +
    facet_grid(rows = vars(product), scales = 'free_y') +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Production Amount") +
    scale_y_continuous(labels = comma) +
    ggtitle('Production Data from Australian Government')

1.3.4 Australian Building Activity

Another set of data produced by the Australian Census Bureau is the Building Activity Publication listing the value of building work done in each quarter. This data contains the total dwellings approved for construction per month, averaged over the past three months, and the value of work done over the past three months.

This data is quarterly from 1996, and we construct the data as before.

approvactiv_raw_tbl <- read_tsv('data/ApprovActiv.dat')

approvactiv_raw_tbl %>% glimpse()
## Observations: 43
## Variables: 2
## $ Approvals <dbl> 9988.0, 10320.3, 10682.3, 11086.7, 11604.0, 11861.0, 1241...
## $ Activity  <dbl> 5747.0, 6388.8, 6715.6, 7048.2, 6600.4, 6919.6, 7255.6, 7...
approvactiv_ts <- approvactiv_raw_tbl %>%
    as.matrix() %>%
    ts(start = 1996, frequency = 4)

approvactiv_ts %>% head(10)
##         Approvals Activity
## 1996 Q1    9988.0   5747.0
## 1996 Q2   10320.3   6388.8
## 1996 Q3   10682.3   6715.6
## 1996 Q4   11086.7   7048.2
## 1997 Q1   11604.0   6600.4
## 1997 Q2   11861.0   6919.6
## 1997 Q3   12418.7   7255.6
## 1997 Q4   12887.0   7211.5
## 1998 Q1   13212.3   7724.6
## 1998 Q2   13486.0   7952.1
approvactiv_ts %>% plot()

approvactiv_tbl <- approvactiv_ts %>%
    tk_tbl(rename_index = 'quarter')

approvactiv_tbl %>% glimpse()
## Observations: 43
## Variables: 3
## $ quarter   <S3: yearqtr> 1996 Q1, 1996 Q2, 1996 Q3, 1996 Q4, 1997 Q1, 1997...
## $ Approvals <dbl> 9988.0, 10320.3, 10682.3, 11086.7, 11604.0, 11861.0, 1241...
## $ Activity  <dbl> 5747.0, 6388.8, 6715.6, 7048.2, 6600.4, 6919.6, 7255.6, 7...
approvactiv_tbl %>% print()
## # A tibble: 43 x 3
##    quarter       Approvals Activity
##    <S3: yearqtr>     <dbl>    <dbl>
##  1 1996 Q1           9988     5747 
##  2 1996 Q2          10320.    6389.
##  3 1996 Q3          10682.    6716.
##  4 1996 Q4          11087.    7048.
##  5 1997 Q1          11604     6600.
##  6 1997 Q2          11861     6920.
##  7 1997 Q3          12419.    7256.
##  8 1997 Q4          12887     7212.
##  9 1998 Q1          13212.    7725.
## 10 1998 Q2          13486     7952.
## # ... with 33 more rows

As before, we can produce time plots in ggplot2.

plot_tbl <- approvactiv_tbl %>%
    gather('label', 'value', -quarter)

ggplot(plot_tbl) +
    geom_line(aes(x = quarter, y = value)) +
    facet_grid(rows = vars(label), scales = 'free_y') +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Amount") +
    scale_y_continuous(labels = comma) +
    ggtitle('Building Activity Data from Australian Government')

1.3.5 CRAN Package Downloads Data

An interesting dataset is the daily count of package downloads from CRAN. This data is easy to obtain via use of the package cranlogs, which gives us use of the cran_downloads() function.

For this workshop, we will look at some of the main packages that comprise the ‘tidyverse’, as well as the total number of downloads from CRAN.

cran_data_file <- 'data/cran_download_data.csv'

if(file.exists(cran_data_file)) {
    cran_data_tbl <- read_csv(cran_data_file)
} else {
    cran_pkgs <- c('dplyr', 'tidyr', 'ggplot2', 'lubridate', 'stringr', 'tibble'
                  ,'broom', 'jsonlite', 'purrr', 'readr', 'tidyquant')
      

    cran_data_tbl <- retrieve_cran_download_data(cran_pkgs, '2014-01-01', '2018-12-31')
    
    write_csv(cran_data_tbl, path = cran_data_file)
}


cran_data_tbl %>% glimpse()
## Observations: 21,912
## Variables: 3
## $ date    <date> 2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014-01-05...
## $ count   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 77, 138, 12...
## $ package <chr> "dplyr", "dplyr", "dplyr", "dplyr", "dplyr", "dplyr", "dply...
cran_data_tbl %>% print()
## # A tibble: 21,912 x 3
##    date       count package
##    <date>     <dbl> <chr>  
##  1 2014-01-01     0 dplyr  
##  2 2014-01-02     0 dplyr  
##  3 2014-01-03     0 dplyr  
##  4 2014-01-04     0 dplyr  
##  5 2014-01-05     0 dplyr  
##  6 2014-01-06     0 dplyr  
##  7 2014-01-07     0 dplyr  
##  8 2014-01-08     0 dplyr  
##  9 2014-01-09     0 dplyr  
## 10 2014-01-10     0 dplyr  
## # ... with 21,902 more rows

First we construct a simple line plot of the download counts, facetted by package.

ggplot(cran_data_tbl) +
    geom_line(aes(x = date, y = count)) +
    facet_wrap(vars(package), scales = 'free_y') +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    ggtitle('Facetted Lineplots of CRAN Daily Downloads')

Not all packages have download data as some packages were created after the start of our observation period. This manifests as zero counts for that package. We discuss these issues later on in the workshop.

1.4 Combining Time Series

Useful insights are often gained from combining different datasets together.

Looking at our datasets, one possible interesting relationship is that between energy production and airline passengers – it is reasonable to expect that both of these quantities will be related as they are related to the overall health and size of the economy.

A major benefit of using tidy tools for time series is to make such data manipulation and arrangement simple: combining datasets is simply a matter of using the two-table joins.

To illustrate, we combine the Air Passenger data with the Australian economic data, using the following code. Note that we rename the air passenger data at the end to make it more meaningful and useful.

ap_econ_combined_tbl <- airpassengers_tbl %>%
    left_join(cbe_tbl, by = 'month') %>%
    filter(complete.cases(.)) %>%
    rename(air = value)


ap_econ_combined_tbl %>% glimpse()
## Observations: 36
## Variables: 5
## $ month <S3: yearmon> Jan 1958, Feb 1958, Mar 1958, Apr 1958, May 1958, Jun...
## $ air   <dbl> 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 3...
## $ choc  <dbl> 1451, 2037, 2477, 2785, 2994, 2681, 3098, 2708, 2517, 2445, 2...
## $ beer  <dbl> 96.3, 84.4, 91.2, 81.9, 80.5, 70.4, 74.8, 75.9, 86.3, 98.7, 1...
## $ elec  <dbl> 1497, 1463, 1648, 1595, 1777, 1824, 1994, 1835, 1787, 1699, 1...
ap_econ_combined_tbl %>% print()
## # A tibble: 36 x 5
##    month           air  choc  beer  elec
##    <S3: yearmon> <dbl> <dbl> <dbl> <dbl>
##  1 Jan 1958        340  1451  96.3  1497
##  2 Feb 1958        318  2037  84.4  1463
##  3 Mar 1958        362  2477  91.2  1648
##  4 Apr 1958        348  2785  81.9  1595
##  5 May 1958        363  2994  80.5  1777
##  6 Jun 1958        435  2681  70.4  1824
##  7 Jul 1958        491  3098  74.8  1994
##  8 Aug 1958        505  2708  75.9  1835
##  9 Sep 1958        404  2517  86.3  1787
## 10 Oct 1958        359  2445  98.7  1699
## # ... with 26 more rows

We return to this dataset later in the workshop.

2 Manipulation of Time Series Data

Much like all data, it is rate to get time-series exactly in the format we want for analysis. For various reasons, we may want to analyse transformations or aggregations of this data.

Much like feature engineering and variable selection, this process can be more art than science - there are no hard and fast rules, merely principles and rules-of-thumb.

The last few years in particular have seen the development of a number of tools to aid us with the analysis of time series along ‘tidy’ principles. In particular, we will focus on the use of tidyquant - a package aimed at analysing financial data, but which is also very useful for time series.

2.1 Aggregating Data

From a conceptual point of view, aggregating time series is the most straightforward - we group the data by longer periods of time and aggregate each individual ‘bucket’ of data as desired or required.

2.1.1 Single Statistics

As an example of this, suppose we wish to look at the air passenger data as an annual sum for each year. Our data is monthly, so we need to aggregate this data into annual numbers.

ap_yearly_dplyr_tbl <- airpassengers_tbl %>%
    group_by(year = month %>% format('%Y') %>% as.numeric()) %>%
    summarise(ann_total = sum(value))


ap_yearly_dplyr_tbl %>% glimpse()
## Observations: 12
## Variables: 2
## $ year      <dbl> 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 195...
## $ ann_total <dbl> 1520, 1676, 2042, 2364, 2700, 2867, 3408, 3939, 4421, 457...
ap_yearly_dplyr_tbl %>% print()
## # A tibble: 12 x 2
##     year ann_total
##    <dbl>     <dbl>
##  1  1949      1520
##  2  1950      1676
##  3  1951      2042
##  4  1952      2364
##  5  1953      2700
##  6  1954      2867
##  7  1955      3408
##  8  1956      3939
##  9  1957      4421
## 10  1958      4572
## 11  1959      5140
## 12  1960      5714

The above transformation was straightforward using existing dplyr functionality, but we can also use routines provided for by tidyquant and its function tq_transmute:

ap_yearly_tidyquant_tbl <- airpassengers_tbl %>%
    tq_transmute(
        select     = value
       ,mutate_fun = apply.yearly
       ,FUN        = sum
       ,na.rm      = TRUE
       ,col_rename = 'ann_total'
    ) %>%
    mutate(year = month %>% format('%Y') %>% as.numeric())


ap_yearly_tidyquant_tbl %>% glimpse()
## Observations: 12
## Variables: 3
## $ month     <S3: yearmon> Dec 1949, Dec 1950, Dec 1951, Dec 1952, Dec 1953,...
## $ ann_total <dbl> 1520, 1676, 2042, 2364, 2700, 2867, 3408, 3939, 4421, 457...
## $ year      <dbl> 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 195...
ap_yearly_tidyquant_tbl %>% print()
## # A tibble: 12 x 3
##    month         ann_total  year
##    <S3: yearmon>     <dbl> <dbl>
##  1 Dec 1949           1520  1949
##  2 Dec 1950           1676  1950
##  3 Dec 1951           2042  1951
##  4 Dec 1952           2364  1952
##  5 Dec 1953           2700  1953
##  6 Dec 1954           2867  1954
##  7 Dec 1955           3408  1955
##  8 Dec 1956           3939  1956
##  9 Dec 1957           4421  1957
## 10 Dec 1958           4572  1958
## 11 Dec 1959           5140  1959
## 12 Dec 1960           5714  1960

We can now look at a lineplot for this new time-series of annual totals. This gives us an idea of the overall trend in the data.

ggplot(ap_yearly_tidyquant_tbl) +
    geom_line(aes(x = year, y = ann_total)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    ggtitle('Lineplot of the Annual Air Passenger Totals')

2.1.1.1 Exercises

  1. Aggregate the Maine unemployment data into yearly totals.
  2. Create the relevant lineplots to check this data for trends and patterns.
  3. Aggregate the CBE data into yearly totals.
  4. Create plots of this data to check for trends.

2.1.2 Multiple Statistics

Should we need multiple statistics in our output, we implement this by writing a custom function that outputs the statistics we require.

custom_stats_func <- function(x, na.rm = TRUE, ...) {
    c(
          sum    = sum(x, na.rm = na.rm)
         ,mean   = mean(x, na.rm = na.rm)
         ,sd     = sd(x, na.rm = na.rm)
         ,median = median(x, na.rm = na.rm)
         ,q25    = quantile(x, na.rm = na.rm, probs = 0.25, names = FALSE)
         ,q75    = quantile(x, na.rm = na.rm, probs = 0.75, names = FALSE)
    )
}


ap_yearly_tidyquant_custom_tbl <- airpassengers_tbl %>%
    tq_transmute(
        select     = value
       ,mutate_fun = apply.yearly
       ,FUN        = custom_stats_func
       ,na.rm      = TRUE
    ) %>%
    mutate(year = month %>% format('%Y') %>% as.numeric())


ap_yearly_tidyquant_custom_tbl %>% glimpse()
## Observations: 12
## Variables: 8
## $ month  <S3: yearmon> Dec 1949, Dec 1950, Dec 1951, Dec 1952, Dec 1953, De...
## $ sum    <dbl> 1520, 1676, 2042, 2364, 2700, 2867, 3408, 3939, 4421, 4572, ...
## $ mean   <dbl> 126.667, 139.667, 170.167, 197.000, 225.000, 238.917, 284.00...
## $ sd     <dbl> 13.7201, 19.0708, 18.4383, 22.9664, 28.4669, 34.9245, 42.140...
## $ median <dbl> 125.0, 137.5, 169.0, 192.0, 232.0, 231.5, 272.0, 315.0, 351....
## $ q25    <dbl> 118.00, 125.75, 159.00, 180.75, 199.75, 221.25, 260.75, 300....
## $ q75    <dbl> 135.25, 151.25, 179.50, 211.25, 238.50, 260.25, 312.75, 359....
## $ year   <dbl> 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, ...
ap_yearly_tidyquant_custom_tbl %>% print()
## # A tibble: 12 x 8
##    month           sum  mean    sd median   q25   q75  year
##    <S3: yearmon> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Dec 1949       1520  127.  13.7   125   118   135.  1949
##  2 Dec 1950       1676  140.  19.1   138.  126.  151.  1950
##  3 Dec 1951       2042  170.  18.4   169   159   180.  1951
##  4 Dec 1952       2364  197   23.0   192   181.  211.  1952
##  5 Dec 1953       2700  225   28.5   232   200.  238.  1953
##  6 Dec 1954       2867  239.  34.9   232.  221.  260.  1954
##  7 Dec 1955       3408  284   42.1   272   261.  313.  1955
##  8 Dec 1956       3939  328.  47.9   315   300.  360.  1956
##  9 Dec 1957       4421  368.  57.9   352.  331.  408.  1957
## 10 Dec 1958       4572  381   64.5   360.  339.  412.  1958
## 11 Dec 1959       5140  428.  69.8   406.  388.  465.  1959
## 12 Dec 1960       5714  476.  77.7   461   418.  515.  1960

2.1.2.1 Exercises

  1. Repeat the aggregation shown using the Maine unemployment data
  2. Repeat the aggregation shown using the CBE data.
  3. Create a custom function to calculate mean, sd, skew and kurtosis

2.2 Rolling Functions

Another common transformation of time-series is to apply a function over a fixed rolling window of data.

Note that rolling functions different conceptually from aggregates as they are not calculated over disjoint subsets of the data: the output is at the same time period as the original data.

Because of this difference we use a different function from tidyquant to execute this calculation: tq_mutate():

2.2.1 Moving Averages

A common rolling function is the moving average: we calculate the average value of the time series over a fixed window of data.

ap_rollmean_sixmonth_tbl <- airpassengers_tbl %>%
    tq_mutate(
        # tq_mutate args
        select     = value
       ,mutate_fun = rollapply 
        # rollapply args
       ,width      = 6
       ,align      = "right"
       ,FUN        = mean
        # mean args
       ,na.rm      = TRUE
        # tq_mutate args
       ,col_rename = "mean_6m"
    )


ap_rollmean_sixmonth_tbl %>% glimpse()
## Observations: 144
## Variables: 3
## $ month   <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, J...
## $ value   <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118,...
## $ mean_6m <dbl> NA, NA, NA, NA, NA, 124.500, 130.500, 135.500, 136.167, 134...
ap_rollmean_sixmonth_tbl %>% print()
## # A tibble: 144 x 3
##    month         value mean_6m
##    <S3: yearmon> <dbl>   <dbl>
##  1 Jan 1949        112     NA 
##  2 Feb 1949        118     NA 
##  3 Mar 1949        132     NA 
##  4 Apr 1949        129     NA 
##  5 May 1949        121     NA 
##  6 Jun 1949        135    124.
##  7 Jul 1949        148    130.
##  8 Aug 1949        148    136.
##  9 Sep 1949        136    136.
## 10 Oct 1949        119    134.
## # ... with 134 more rows

We compare the two values by plotting the original time series against its moving average.

plot_tbl <- ap_rollmean_sixmonth_tbl %>%
    rename(orig = value) %>%
    gather('label', 'value', -month)


ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = label)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Month') +
    ylab('Passenger Total') +
    ggtitle('Comparison Plot of the Air Passenger Counts')

Note that the moving-average series does not start at the same timestamp as the original dataset size is reduced by the windowing function.

We can add multiple moving averages to a time series by chaining a series of tq_mutate() calls together.

ap_rollmean_multi_tbl <- airpassengers_tbl %>%
    tq_mutate(
        # tq_mutate args
        select     = value
       ,mutate_fun = rollapply
        # rollapply args
       ,width      = 6
       ,align      = "right"
       ,FUN        = mean
        # mean args
       ,na.rm      = TRUE
        # tq_mutate args
       ,col_rename = "mean_6m"
    ) %>%
    tq_mutate(
        # tq_mutate args
        select     = value
       ,mutate_fun = rollapply 
        # rollapply args
       ,width      = 12
       ,align      = "right"
       ,FUN        = mean
        # mean args
       ,na.rm      = TRUE
        # tq_mutate args
       ,col_rename = "mean_12m"
    )
  

ap_rollmean_multi_tbl %>% glimpse()
## Observations: 144
## Variables: 4
## $ month    <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, ...
## $ value    <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118...
## $ mean_6m  <dbl> NA, NA, NA, NA, NA, 124.500, 130.500, 135.500, 136.167, 13...
## $ mean_12m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 126.667, 126.9...
ap_rollmean_multi_tbl %>% print()
## # A tibble: 144 x 4
##    month         value mean_6m mean_12m
##    <S3: yearmon> <dbl>   <dbl>    <dbl>
##  1 Jan 1949        112     NA        NA
##  2 Feb 1949        118     NA        NA
##  3 Mar 1949        132     NA        NA
##  4 Apr 1949        129     NA        NA
##  5 May 1949        121     NA        NA
##  6 Jun 1949        135    124.       NA
##  7 Jul 1949        148    130.       NA
##  8 Aug 1949        148    136.       NA
##  9 Sep 1949        136    136.       NA
## 10 Oct 1949        119    134.       NA
## # ... with 134 more rows

As before, we now create a lineplot of the three values to show the effect of the different window sizes.

plot_tbl <- ap_rollmean_multi_tbl %>%
    rename(orig = value) %>%
    gather('label', 'value', -month)


ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = label)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Month') +
    ylab('Passenger Total') +
    ggtitle('Comparison Plot of the Air Passenger Counts')

The twelve month time series is shorter than the six month series as it has a wider calculation window.

Any sort of other windowing functions can be applied, including the standard deviation, allowing us to include a range of possible values.

ribbon_func <- function(x, na.rm = TRUE, ...) {
    mu    <- mean(x, na.rm = na.rm)
    sigma <- sd(x, na.rm = na.rm)
    
    lower <- mu - 2 * sigma
    upper <- mu + 2 * sigma
    
    return(c(mu = mu, l2sd = lower, u2sd = upper))
}


ap_roll_ribbon_tbl <- airpassengers_tbl %>%
    tq_mutate(
        # tq_mutate args
        select     = value
       ,mutate_fun = rollapply 
        # rollapply args
       ,width      = 6
       ,align      = "right"
       ,by.column  = FALSE
       ,FUN        = ribbon_func
        # mean args
       ,na.rm      = TRUE
    )
  

ap_roll_ribbon_tbl %>% glimpse()
## Observations: 144
## Variables: 5
## $ month <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, Jun...
## $ value <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 1...
## $ mu    <dbl> NA, NA, NA, NA, NA, 124.500, 130.500, 135.500, 136.167, 134.5...
## $ l2sd  <dbl> NA, NA, NA, NA, NA, 106.6674, 109.0058, 114.0058, 114.9472, 1...
## $ u2sd  <dbl> NA, NA, NA, NA, NA, 142.333, 151.994, 156.994, 157.386, 159.6...
ap_roll_ribbon_tbl %>% print()
## # A tibble: 144 x 5
##    month         value    mu  l2sd  u2sd
##    <S3: yearmon> <dbl> <dbl> <dbl> <dbl>
##  1 Jan 1949        112   NA    NA    NA 
##  2 Feb 1949        118   NA    NA    NA 
##  3 Mar 1949        132   NA    NA    NA 
##  4 Apr 1949        129   NA    NA    NA 
##  5 May 1949        121   NA    NA    NA 
##  6 Jun 1949        135  124.  107.  142.
##  7 Jul 1949        148  130.  109.  152.
##  8 Aug 1949        148  136.  114.  157.
##  9 Sep 1949        136  136.  115.  157.
## 10 Oct 1949        119  134.  109.  160.
## # ... with 134 more rows

We now plot the original data against the moving average and the mean.

ggplot(ap_roll_ribbon_tbl) +
    geom_line(aes(x = month, y = value)) +
    geom_line(aes(x = month, y = mu), colour = 'red') +
    geom_ribbon(aes(x = month, ymin = l2sd, ymax = u2sd)
               ,colour = 'grey', alpha = 0.25) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Month') +
    ylab('Passenger Total') +
    ggtitle('Ribbon Plot of the Air Passenger Counts (6 month window)')

We now repeat this process with using a twelve-month window for the data.

ap_roll_12m_ribbon_tbl <- airpassengers_tbl %>%
    tq_mutate(
        # tq_mutate args
        select     = value
       ,mutate_fun = rollapply 
        # rollapply args
       ,width      = 12
       ,align      = "right"
       ,by.column  = FALSE
       ,FUN        = ribbon_func
        # mean args
       ,na.rm      = TRUE
    )
  

ap_roll_12m_ribbon_tbl %>% glimpse()
## Observations: 144
## Variables: 5
## $ month <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, Jun...
## $ value <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 1...
## $ mu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 126.667, 126.917,...
## $ l2sd  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 99.2264, 100.0100...
## $ u2sd  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 154.107, 153.823,...
ap_roll_12m_ribbon_tbl %>% print()
## # A tibble: 144 x 5
##    month         value    mu  l2sd  u2sd
##    <S3: yearmon> <dbl> <dbl> <dbl> <dbl>
##  1 Jan 1949        112    NA    NA    NA
##  2 Feb 1949        118    NA    NA    NA
##  3 Mar 1949        132    NA    NA    NA
##  4 Apr 1949        129    NA    NA    NA
##  5 May 1949        121    NA    NA    NA
##  6 Jun 1949        135    NA    NA    NA
##  7 Jul 1949        148    NA    NA    NA
##  8 Aug 1949        148    NA    NA    NA
##  9 Sep 1949        136    NA    NA    NA
## 10 Oct 1949        119    NA    NA    NA
## # ... with 134 more rows

Having constructed the data, we once again create a ribbon plot with these quantities.

ggplot(ap_roll_12m_ribbon_tbl) +
    geom_line(aes(x = month, y = value)) +
    geom_line(aes(x = month, y = mu), colour = 'red') +
    geom_ribbon(aes(x = month, ymin = l2sd, ymax = u2sd)
               ,colour = 'grey', alpha = 0.25) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Month') +
    ylab('Passenger Total') +
    ggtitle('Ribbon Plot of the Air Passenger Counts (12 month window)')

2.2.1.1 Exercises

  1. Construct a 3 month moving average for the passenger data and compare it to the 6 and 12 month values.
  2. Calculate the 6 month and 12 month rolling average values for the Maine unemployment data.
  3. Construct the ribbon plot for the Maine unemployment data.
  4. Construct moving average data for the CBE dataset. This process may be made easier by reshaping the data.

2.2.2 Differences

Another common transformation of the data is to take the ‘first differences’ of the values, i.e. we convert the time series of values into one of differences. We discuss the reasons for this later on – for now we focus on the mechanics of creating first differences.

ap_firstdiff_tbl <- airpassengers_tbl %>%
    mutate(diff = value - lag(value, n = 1))


ap_firstdiff_tbl %>% glimpse()
## Observations: 144
## Variables: 3
## $ month <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, Jun...
## $ value <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 1...
## $ diff  <dbl> NA, 6, 14, -3, -8, 14, 13, 0, -12, -17, -15, 14, -3, 11, 15, ...
ap_firstdiff_tbl %>% print()
## # A tibble: 144 x 3
##    month         value  diff
##    <S3: yearmon> <dbl> <dbl>
##  1 Jan 1949        112    NA
##  2 Feb 1949        118     6
##  3 Mar 1949        132    14
##  4 Apr 1949        129    -3
##  5 May 1949        121    -8
##  6 Jun 1949        135    14
##  7 Jul 1949        148    13
##  8 Aug 1949        148     0
##  9 Sep 1949        136   -12
## 10 Oct 1949        119   -17
## # ... with 134 more rows

Having calculated the differences, we now produce a lineplot of those values.

plot_tbl <- ap_firstdiff_tbl %>%
    rename(count = value) %>%
    gather('series', 'value', -month)


ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = series)) +
    scale_y_continuous(labels = comma) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Value') +
    ggtitle('Plot of the Air Passenger Counts and First Differences')

As we see with this plot, the first differences of the passenger data does not contain a trend.

2.2.2.1 Exercises

  1. Calculate the first differences for the Maine unemployment data.
  2. Create a lineplot of this data to check for its value.
  3. Calculate the first differences for the CBE data.
  4. Create lineplots for the CBE differences.
  5. Using the lag() function with the Air Passenger data, calculate the percentage changes data instead of the arithmetic changes.
  6. Construct the lineplot for the percentage change values.

3 Exploratory Data Analysis of Time Series

The first step in all exploratory analysis is simple visualisation: simple lines plots such as those we have seen are our starting point. The human brain is excellent at pattern recognition, so a simple plot often guides our analysis more effectively than a suite of numerical diagnostics.

Rather than discuss different techniques, we will explore our sample data as a way to illustrate some ways of initially exploring the datasets.

3.1 Air Passenger

We start with the air passenger data, and remind ourselves of the basic structure of the data.

airpassengers_tbl %>% glimpse()
## Observations: 144
## Variables: 2
## $ month <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, Jun...
## $ value <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 1...
airpassengers_tbl %>% print()
## # A tibble: 144 x 2
##    month         value
##    <S3: yearmon> <dbl>
##  1 Jan 1949        112
##  2 Feb 1949        118
##  3 Mar 1949        132
##  4 Apr 1949        129
##  5 May 1949        121
##  6 Jun 1949        135
##  7 Jul 1949        148
##  8 Aug 1949        148
##  9 Sep 1949        136
## 10 Oct 1949        119
## # ... with 134 more rows

3.1.1 Raw Data

We remind ourselves of the lineplot of the raw data.

ggplot(airpassengers_tbl) +
    geom_line(aes(x = month, y = value)) +
    xlab('Date') +
    ylab('Passenger Count') +
    expand_limits(y = 0) +
    ggtitle('Plot of Air Passenger Time Series')

This plot suggests a strong seasonal effect as well as a trend so this is the first thing to investigate.

It may help if we can add points to the plot to indicate the months, so we add points for the month of June to help us identify years.

ap_jun_tbl <- airpassengers_tbl %>% filter(format(month, '%m') == '06')

ggplot(airpassengers_tbl) +
    geom_line(aes(x = month, y = value)) +
    geom_point(aes(x = month, y = value), data = ap_jun_tbl, size = 2) +
    xlab('Date') +
    ylab('Passenger Count') +
    expand_limits(y = 0) +
    ggtitle('Plot of Air Passenger Time Series')

To look into trends, we have a number of options: we could look at yearly sums or averages, or we could look at moving averages.

ggplot(ap_yearly_tidyquant_tbl) +
    geom_line(aes(x = year, y = ann_total)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    ggtitle('Lineplot of the Annual Air Passenger Totals')

One way to investigate the seasonality in the dataset is to construct a boxplot of the passenger counts, grouping by month.

plot_tbl <- airpassengers_tbl %>%
    mutate(cal_month = format(month, '%m'))

ggplot(plot_tbl) +
    geom_boxplot(aes(x = cal_month, y = value)) +
    scale_y_continuous(labels = comma) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Passenger Count') +
    ggtitle('Boxplot of the Air Passenger Counts')

We can see some aspects of the data seasonality in this boxplot, but the multiplicative nature of the plots means the seasonal trends is obscured a little.

3.1.2 First Differences

We also produce a similar boxplot, but this time looking at the first differences.

plot_tbl <- ap_firstdiff_tbl %>%
    mutate(cal_month = format(month, '%m'))

ggplot(plot_tbl) +
    geom_boxplot(aes(x = cal_month, y = diff)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Month') +
    ylab('Passenger Count Changes') +
    ggtitle('Boxplot of the First Differences of the Air Passenger Counts')

The seasonality in the data comes through much stronger with this plot. We see much bigger monthly effects.

3.1.3 Percentage Changes

To look into multiplicative effects we check the percentage changes from month to month.

ap_perc_change_tbl <- airpassengers_tbl %>%
    mutate(cal_month   = format(month, '%m')
          ,perc_change = value / lag(value, n = 1) - 1)


ap_perc_change_tbl %>% glimpse()
## Observations: 144
## Variables: 4
## $ month       <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 194...
## $ value       <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, ...
## $ cal_month   <chr> "01", "02", "03", "04", "05", "06", "07", "08", "09", "...
## $ perc_change <dbl> NA, 0.0535714, 0.1186441, -0.0227273, -0.0620155, 0.115...
ap_perc_change_tbl %>% print()
## # A tibble: 144 x 4
##    month         value cal_month perc_change
##    <S3: yearmon> <dbl> <chr>           <dbl>
##  1 Jan 1949        112 01            NA     
##  2 Feb 1949        118 02             0.0536
##  3 Mar 1949        132 03             0.119 
##  4 Apr 1949        129 04            -0.0227
##  5 May 1949        121 05            -0.0620
##  6 Jun 1949        135 06             0.116 
##  7 Jul 1949        148 07             0.0963
##  8 Aug 1949        148 08             0     
##  9 Sep 1949        136 09            -0.0811
## 10 Oct 1949        119 10            -0.125 
## # ... with 134 more rows

Having looked at the data and the column, we now look at some simple lineplots as before.

ggplot(ap_perc_change_tbl) +
    geom_line(aes(x = month, y = perc_change)) +
    expand_limits(y = 0) +
    xlab('Date') +
    ylab('Percentage Change') +
    ggtitle('Lineplot of the Percentage Changes of the Air Passenger Counts')

Much like the arithmetic differences, the percentage changes are centred around zero, so now we can look at a boxplot of them.

ggplot(ap_perc_change_tbl) +
    geom_boxplot(aes(x = cal_month, y = perc_change)) +
    xlab('Month') +
    ylab('Percentage Changes') +
    ggtitle('Boxplot of the Percentage Changes of the Air Passenger Counts')

3.2 Maine Unemployment

We now explore the Maine unemployment data.

As before, we look at the raw data, the first differences and the percentage changes.

maine_explore_tbl <- maine_tbl %>%
    mutate(cal_month   = format(month, '%m')
          ,diff        = value - lag(value, n = 1)
          ,perc_change = value / lag(value, n = 1) - 1
           )


maine_explore_tbl %>% glimpse()
## Observations: 128
## Variables: 5
## $ month       <S3: yearmon> Jan 1996, Feb 1996, Mar 1996, Apr 1996, May 199...
## $ value       <dbl> 6.7, 6.7, 6.4, 5.9, 5.2, 4.8, 4.8, 4.0, 4.2, 4.4, 5.0, ...
## $ cal_month   <chr> "01", "02", "03", "04", "05", "06", "07", "08", "09", "...
## $ diff        <dbl> NA, 0.0, -0.3, -0.5, -0.7, -0.4, 0.0, -0.8, 0.2, 0.2, 0...
## $ perc_change <dbl> NA, 0.0000000, -0.0447761, -0.0781250, -0.1186441, -0.0...
maine_explore_tbl %>% print()
## # A tibble: 128 x 5
##    month         value cal_month    diff perc_change
##    <S3: yearmon> <dbl> <chr>       <dbl>       <dbl>
##  1 Jan 1996        6.7 01         NA         NA     
##  2 Feb 1996        6.7 02          0          0     
##  3 Mar 1996        6.4 03         -0.300     -0.0448
##  4 Apr 1996        5.9 04         -0.5       -0.0781
##  5 May 1996        5.2 05         -0.7       -0.119 
##  6 Jun 1996        4.8 06         -0.4       -0.0769
##  7 Jul 1996        4.8 07          0          0     
##  8 Aug 1996        4   08         -0.800     -0.167 
##  9 Sep 1996        4.2 09          0.2        0.05  
## 10 Oct 1996        4.4 10          0.2        0.0476
## # ... with 118 more rows

3.2.1 Raw Data

We start by the standard lineplot of the values.

ggplot(maine_explore_tbl) +
    geom_line(aes(x = month, y = value)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Time') +
    ylab('Maine Unemployment Count (thousands)') +
    ggtitle('Lineplot of the Maine Unemployment Data')

We do not see a trend in this data but the data does seem to have strong seasonal patterns.

To investigate the seasonality we construct the monthly boxplots from the raw data. Employment is often seasonal in nature - for example, retail is very busy towards Christmas each year. As such, we expect to see a large seasonal component in this data.

ggplot(maine_explore_tbl) +
    geom_boxplot(aes(x = cal_month, y = value)) +
    expand_limits(y = 0) +
    scale_y_continuous(labels = comma) +
    xlab('Month') +
    ylab('Maine Unemployment Count (thousands)') +
    ggtitle('Monthly Boxplot of the Maine Unemployment Data')

We see definite changes over the months, though the effect does not seem as pronounced here as it was for the airline passenger counts.

3.2.2 First Differences

We now look at first differences for the unemployment data, and start with the line plot.

ggplot(maine_explore_tbl) +
    geom_line(aes(x = month, y = diff)) +
    expand_limits(y = 0) +
    xlab('Time') +
    ylab('Maine Unemployment Count Differences') +
    ggtitle('Lineplot of the Differences in Maine Unemployment Data')

The original data does not exhibit a strong trend, and the first differences are similar.

ggplot(maine_explore_tbl) +
    geom_boxplot(aes(x = cal_month, y = diff)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Maine Unemployment Count Differences') +
    ggtitle('Monthly Boxplot of the Maine Unemployment Data')

We see strong seasonal differences in the monthly data.

3.2.3 Percentage Changes

We look at the lineplot of the percentage changes next.

ggplot(maine_explore_tbl) +
    geom_line(aes(x = month, y = perc_change)) +
    expand_limits(y = 0) +
    xlab('Time') +
    ylab('Maine Unemployment Count Percentage Changes') +
    ggtitle('Lineplot of the Percentage Changes in Maine Unemployment Data')

As for the differences, no major trends or patterns emerge from this plot.

We now look for seasonal patterns in the percentage changes.

ggplot(maine_explore_tbl) +
    geom_boxplot(aes(x = cal_month, y = perc_change)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Maine Unemployment Count Percentage Changes') +
    ggtitle('Monthly Boxplot of the Percentage Changes in Maine Unemployment Data')

4 Time Series Decomposition

Many time series are dominated by trends or seasonal effects, and we can create fairly simple models based on these two components. The first of these, the additive decompositional model, is just the sum of these effects, with the residual component being treated as random:

\[ x_t = m_t + s_t + z_t, \]

where, at any given time \(t\),

\[\begin{eqnarray*} x_t && \text{is the observed value} \\ m_t && \text{is the trend} \\ s_t && \text{is the seasonal component} \\ z_t && \text{is the error term} \end{eqnarray*}\]

It is worth noting that, in general, the error terms will be a correlated sequence of values, something we will account for and model later.

In other cases, we could have a situation where the seasonal effect increases as the trend increases, modeling the values as:

\[ x_t = m_t s_t + z_t. \]

Other options also exist, such as modeling the log of the observed values, which does cause some non-trivial modeling issues, such as biasing any predicted values for the time series.

Various methods are used for estimating the trend, such as taking a moving average of the values, which is a common approach.

4.1 Simple Decomposition

We start with the simplest decomposition: the simple additive model with a moving average trend.

4.1.1 Additive Models

The decompose function in R allows us to build this:

ap_ts_sa_decompose <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    decompose()

ap_ts_sa_decompose %>% plot()

The package sweep provides us with a number of routines to help us tidy the output of these routines.

ap_ts_sa_decompose_tbl <- ap_ts_sa_decompose %>%
    sw_tidy_decomp(rename_index = 'month')


ap_ts_sa_decompose_tbl %>% glimpse()
## Observations: 144
## Variables: 6
## $ month    <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, ...
## $ observed <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118...
## $ season   <dbl> -24.74874, -36.18813, -2.24116, -8.03662, -4.50631, 35.402...
## $ trend    <dbl> NA, NA, NA, NA, NA, NA, 126.792, 127.250, 127.958, 128.583...
## $ random   <dbl> NA, NA, NA, NA, NA, NA, -42.62247, -42.07323, -8.47854, 11...
## $ seasadj  <dbl> 136.7487, 154.1881, 134.2412, 137.0366, 125.5063, 99.5972,...
ap_ts_sa_decompose_tbl %>% print()
## # A tibble: 144 x 6
##    month         observed season trend random seasadj
##    <S3: yearmon>    <dbl>  <dbl> <dbl>  <dbl>   <dbl>
##  1 Jan 1949           112 -24.7    NA   NA      137. 
##  2 Feb 1949           118 -36.2    NA   NA      154. 
##  3 Mar 1949           132  -2.24   NA   NA      134. 
##  4 Apr 1949           129  -8.04   NA   NA      137. 
##  5 May 1949           121  -4.51   NA   NA      126. 
##  6 Jun 1949           135  35.4    NA   NA       99.6
##  7 Jul 1949           148  63.8   127. -42.6     84.2
##  8 Aug 1949           148  62.8   127. -42.1     85.2
##  9 Sep 1949           136  16.5   128.  -8.48   119. 
## 10 Oct 1949           119 -20.6   129.  11.1    140. 
## # ... with 134 more rows

We have decomposed the time series into components, and the sw_tidy_decomp() function calculates the ‘seasonally adjusted’ values of the trend i.e. the ‘underlying’ value ignoring seasonality.

Comparing the observed and seasonally adjusted values are straight-forward from this, we plot both together.

ggplot(ap_ts_sa_decompose_tbl) +
    geom_line(aes(x = month, y = observed)) +
    geom_line(aes(x = month, y = seasadj), colour = 'red') +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Passenger Count') +
    ggtitle('Comparison Lineplot of Passenger Data and Seasonal Counts'
           ,subtitle = '(seasonal adjustments in red)')

We now want to plot this data, and the easiest way to do this is to convert it to ‘long’ format and plot each quantity separately.

plot_tbl <- ap_ts_sa_decompose_tbl %>%
    gather('label', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value)) +
    facet_grid(rows = vars(label)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Value') +
    ggtitle('Decomposition Plot of the Simple Additive Model')

Due to the difference in scales for each parameter, we redo this plot but we free up the \(y\)-axis scale.

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value)) +
    facet_grid(rows = vars(label), scales = 'free_y') +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Value') +
    ggtitle('Decomposition Plot of the Simple Additive Model')

4.1.1.1 Exercises

  1. Construct the simple additive decomposition on the Maine unemployment data.
  2. Construct the simple additive decomposition on the Australian CBE data.

4.1.2 Multiplicative Models

Constructing the multiplicative decomposition model is similar, but this time the seasonal component and error are multiplied to the trend rather than added.

Other than this, the multiplicative model works in a similar fashion. Once again, while we focus on creating plots using ggplot2 we will use the default plotting system as a quick check.

ap_ts_sm_decompose <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    decompose(type = 'multiplicative')

ap_ts_sm_decompose %>% plot()

Having created the decomposition object, we now inspect the outputs and produce plots.

ap_ts_sm_decompose_tbl <- ap_ts_sm_decompose %>%
    sw_tidy_decomp(rename_index = 'month')

ap_ts_sm_decompose_tbl %>% glimpse()
## Observations: 144
## Variables: 6
## $ month    <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, ...
## $ observed <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118...
## $ season   <dbl> 0.910230, 0.883625, 1.007366, 0.975906, 0.981378, 1.112776...
## $ trend    <dbl> NA, NA, NA, NA, NA, NA, 126.792, 127.250, 127.958, 128.583...
## $ random   <dbl> NA, NA, NA, NA, NA, NA, 0.951664, 0.953401, 1.002220, 1.00...
## $ seasadj  <dbl> 123.046, 133.541, 131.035, 132.185, 123.296, 121.318, 120....
ap_ts_sm_decompose_tbl %>% print()
## # A tibble: 144 x 6
##    month         observed season trend random seasadj
##    <S3: yearmon>    <dbl>  <dbl> <dbl>  <dbl>   <dbl>
##  1 Jan 1949           112  0.910   NA  NA        123.
##  2 Feb 1949           118  0.884   NA  NA        134.
##  3 Mar 1949           132  1.01    NA  NA        131.
##  4 Apr 1949           129  0.976   NA  NA        132.
##  5 May 1949           121  0.981   NA  NA        123.
##  6 Jun 1949           135  1.11    NA  NA        121.
##  7 Jul 1949           148  1.23   127.  0.952    121.
##  8 Aug 1949           148  1.22   127.  0.953    121.
##  9 Sep 1949           136  1.06   128.  1.00     128.
## 10 Oct 1949           119  0.922  129.  1.00     129.
## # ... with 134 more rows

While the model is different from before, it produces similar outputs, and so our plot code is almost identical to before.

We compare the observed values to the seasonally-adjusted ones first via a lineplot.

ggplot(ap_ts_sm_decompose_tbl) +
    geom_line(aes(x = month, y = observed)) +
    geom_line(aes(x = month, y = seasadj), colour = 'red') +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Passenger Count') +
    ggtitle('Comparison Lineplot of Passenger Data and Seasonal Counts (Multiplicative Mdodel)'
           ,subtitle = '(seasonal adjustments in red)')

We then look at the full decomposition plot, and each of the components.

plot_tbl <- ap_ts_sm_decompose_tbl %>%
    gather('label', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value)) +
    facet_grid(rows = vars(label)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Value') +
    ggtitle('Decomposition Plot of the Simple Multiplicative Model for the Air Passenger Data')

Like for the additive model, the components of the decomposition are on different scales, so we redo the plot with a free \(y\)-axis.

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value)) +
    facet_grid(rows = vars(label), scales = 'free_y') +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Value') +
    ggtitle('Decomposition Plot of the Simple Multiplicative Model for the Air Passenger Data')

4.1.2.1 Exercises

  1. Construct the simple multiplicative decomposition on the Maine unemployment data.
  2. Construct the simple multiplicative decomposition on the Australian CBE data.

4.1.3 Drawbacks of Decomposition

The above approach for time-series decomposition have the benefit of simplicity both in terms of intuition and calculation, but come with a number of drawbacks:

  • Estimates of the trend are not available at the start and end of the time-series.
  • The trend-cycle estimate tends to over-smooth rapid rises and falls in the data - we can check for this by inspecting the remainder.
  • Seasonal components are fixed throughout the dataset and do not change - an assumption much too simplistic for real-world data, especially those collected over long periods of time
  • Idiosyncratic events causing short-term changes in behaviour are not managed well

4.2 STL Decomposition

The estimation of trend in a time series is often referred to as smoothing as it tries to remove higher frequency noise in the signal to understand the underlying signal.

These smoothing procedures often use data points both before and after the time at which the smoothed estimate is to calculated. This makes it useless for direct forecasting, but nevertheless can reveal interesting structure in the data.

Previously we looked at using moving averages to estimate the trend, but an alternative approach is to use Seasonal and Trend decomposition using Loess or STL. This estimates the trend by using locally-weighted regression (loess) methods.

4.2.1 Periodic STL

Our first use of STL involves using a regression window equal to the size of the period of the data. For AirPassengers, this is 12, but we do not have to set this explicitly.

ap_stl_decompose <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    magrittr::extract(,1) %>%       # need to use this for compatibility with stl()
    stl(s.window = 'periodic')

ap_stl_decompose %>% plot()

As before, we can use functions in the sweep package to create tidy outputs to ease plotting and analysis.

ap_stl_decompose_tbl <- ap_stl_decompose %>%
    sw_tidy_decomp(rename_index = 'month')

ap_stl_decompose_tbl %>% glimpse()
## Observations: 144
## Variables: 6
## $ month     <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949,...
## $ observed  <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 11...
## $ season    <dbl> -25.49772, -35.22093, -3.02748, -8.29905, -5.73729, 32.33...
## $ trend     <dbl> 127.187, 126.650, 126.112, 126.199, 126.286, 126.733, 127...
## $ remainder <dbl> 10.310370, 26.571403, 8.915762, 11.100122, 0.451141, -24....
## $ seasadj   <dbl> 137.4977, 153.2209, 135.0275, 137.2991, 126.7373, 102.663...
ap_stl_decompose_tbl %>% print()
## # A tibble: 144 x 6
##    month         observed season trend remainder seasadj
##    <S3: yearmon>    <dbl>  <dbl> <dbl>     <dbl>   <dbl>
##  1 Jan 1949          112. -25.5   127.    10.3     137. 
##  2 Feb 1949          118  -35.2   127.    26.6     153. 
##  3 Mar 1949          132   -3.03  126.     8.92    135. 
##  4 Apr 1949          129   -8.30  126.    11.1     137. 
##  5 May 1949          121   -5.74  126.     0.451   127. 
##  6 Jun 1949          135   32.3   127.   -24.1     103. 
##  7 Jul 1949          148   70.2   127.   -49.4      77.8
##  8 Aug 1949          148   68.0   127.   -47.5      80.0
##  9 Sep 1949          136   17.4   128.    -9.09    119. 
## 10 Oct 1949          119. -21.1   129.    11.0     140. 
## # ... with 134 more rows

This decomposition seems similar to the one produced by the decompose() function, so we should compare the output. This is where tidy output is useful: we use standard dplyr and tidyr methodologies to manipulate the data and produce plots for visual inspection.

decomp_tbl <- ap_ts_sa_decompose_tbl %>%
    rename(remainder = random)

stl_tbl <- ap_stl_decompose_tbl


decomp_compare_tbl <- list(
    `Simple Additive` = decomp_tbl
   ,`STL Periodic`    = stl_tbl
    ) %>%
    bind_rows(.id = 'category')

decomp_compare_tbl %>% glimpse()
## Observations: 288
## Variables: 7
## $ category  <chr> "Simple Additive", "Simple Additive", "Simple Additive", ...
## $ month     <dbl> 1949.00, 1949.08, 1949.17, 1949.25, 1949.33, 1949.42, 194...
## $ observed  <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 11...
## $ season    <dbl> -24.74874, -36.18813, -2.24116, -8.03662, -4.50631, 35.40...
## $ trend     <dbl> NA, NA, NA, NA, NA, NA, 126.792, 127.250, 127.958, 128.58...
## $ remainder <dbl> NA, NA, NA, NA, NA, NA, -42.62247, -42.07323, -8.47854, 1...
## $ seasadj   <dbl> 136.7487, 154.1881, 134.2412, 137.0366, 125.5063, 99.5972...
decomp_compare_tbl %>% print()
## # A tibble: 288 x 7
##    category        month observed season trend remainder seasadj
##    <chr>           <dbl>    <dbl>  <dbl> <dbl>     <dbl>   <dbl>
##  1 Simple Additive 1949       112 -24.7    NA      NA      137. 
##  2 Simple Additive 1949.      118 -36.2    NA      NA      154. 
##  3 Simple Additive 1949.      132  -2.24   NA      NA      134. 
##  4 Simple Additive 1949.      129  -8.04   NA      NA      137. 
##  5 Simple Additive 1949.      121  -4.51   NA      NA      126. 
##  6 Simple Additive 1949.      135  35.4    NA      NA       99.6
##  7 Simple Additive 1950.      148  63.8   127.    -42.6     84.2
##  8 Simple Additive 1950.      148  62.8   127.    -42.1     85.2
##  9 Simple Additive 1950.      136  16.5   128.     -8.48   119. 
## 10 Simple Additive 1950.      119 -20.6   129.     11.1    140. 
## # ... with 278 more rows
ggplot(decomp_compare_tbl) +
    geom_line(aes(x = month, y = season, colour = category)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Seasonal Value') +
    ggtitle('Comparison Plot for Seasonal Values between Decompositional Methods')

Seasonal values are almost the same, so we now compare trend values.

ggplot(decomp_compare_tbl) +
    geom_line(aes(x = month, y = trend, colour = category)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Trend Value') +
    ggtitle('Comparison Plot for Trends between Decompositional Methods')

We see here that the STL with Periodic smoothing window is almost the same as the standard decompose() function for the Air Passenger data.

4.2.2 Robust STL

We can also use robust loess to perform the STL, and compare the results.

ap_stl_robust_decompose <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    magrittr::extract(,1) %>%       # need to use this for compatibility with stl()
    stl(s.window = 'periodic', robust = TRUE)

ap_stl_robust_decompose %>% plot()

As before, we can use functions in the sweep package to create tidy outputs to ease plotting and analysis.

ap_stl_robust_decompose_tbl <- ap_stl_robust_decompose %>%
    sw_tidy_decomp(rename_index = 'month')

ap_stl_robust_decompose_tbl %>% glimpse()
## Observations: 144
## Variables: 6
## $ month     <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949,...
## $ observed  <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 11...
## $ season    <dbl> -16.519819, -27.337882, 9.009778, 0.992387, 3.081231, 20....
## $ trend     <dbl> 123.186, 123.421, 123.657, 124.137, 124.616, 125.216, 125...
## $ remainder <dbl> 5.3341624, 21.9164399, -0.6670047, 3.8711054, -6.6970183,...
## $ seasadj   <dbl> 128.520, 145.338, 122.990, 128.008, 117.919, 114.997, 114...
ap_stl_robust_decompose_tbl %>% print()
## # A tibble: 144 x 6
##    month         observed  season trend remainder seasadj
##    <S3: yearmon>    <dbl>   <dbl> <dbl>     <dbl>   <dbl>
##  1 Jan 1949          112  -16.5    123.     5.33     129.
##  2 Feb 1949          118. -27.3    123.    21.9      145.
##  3 Mar 1949          132    9.01   124.    -0.667    123.
##  4 Apr 1949          129    0.992  124.     3.87     128.
##  5 May 1949          121    3.08   125.    -6.70     118.
##  6 Jun 1949          135   20.0    125.   -10.2      115.
##  7 Jul 1949          148   33.4    126.   -11.2      115.
##  8 Aug 1949          148   40.1    126.   -18.4      108.
##  9 Sep 1949          136   22.2    127.   -13.1      114.
## 10 Oct 1949          119. -13.6    128.     4.82     133.
## # ... with 134 more rows

4.2.3 Decomposition Comparisons

We now directly compare all the decompositions.

decomp_compare_tbl <- list(
    `Simple Additive` = decomp_tbl
   ,`STL Periodic`    = stl_tbl
   ,`STL Robust`      = ap_stl_robust_decompose_tbl
    ) %>%
    bind_rows(.id = 'category')

decomp_compare_tbl %>% glimpse()
## Observations: 432
## Variables: 7
## $ category  <chr> "Simple Additive", "Simple Additive", "Simple Additive", ...
## $ month     <dbl> 1949.00, 1949.08, 1949.17, 1949.25, 1949.33, 1949.42, 194...
## $ observed  <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 11...
## $ season    <dbl> -24.74874, -36.18813, -2.24116, -8.03662, -4.50631, 35.40...
## $ trend     <dbl> NA, NA, NA, NA, NA, NA, 126.792, 127.250, 127.958, 128.58...
## $ remainder <dbl> NA, NA, NA, NA, NA, NA, -42.62247, -42.07323, -8.47854, 1...
## $ seasadj   <dbl> 136.7487, 154.1881, 134.2412, 137.0366, 125.5063, 99.5972...
decomp_compare_tbl %>% print()
## # A tibble: 432 x 7
##    category        month observed season trend remainder seasadj
##    <chr>           <dbl>    <dbl>  <dbl> <dbl>     <dbl>   <dbl>
##  1 Simple Additive 1949       112 -24.7    NA      NA      137. 
##  2 Simple Additive 1949.      118 -36.2    NA      NA      154. 
##  3 Simple Additive 1949.      132  -2.24   NA      NA      134. 
##  4 Simple Additive 1949.      129  -8.04   NA      NA      137. 
##  5 Simple Additive 1949.      121  -4.51   NA      NA      126. 
##  6 Simple Additive 1949.      135  35.4    NA      NA       99.6
##  7 Simple Additive 1950.      148  63.8   127.    -42.6     84.2
##  8 Simple Additive 1950.      148  62.8   127.    -42.1     85.2
##  9 Simple Additive 1950.      136  16.5   128.     -8.48   119. 
## 10 Simple Additive 1950.      119 -20.6   129.     11.1    140. 
## # ... with 422 more rows

We have constructed the comparison dataset and now plot the seasonal values for the different decompositions.

ggplot(decomp_compare_tbl) +
    geom_line(aes(x = month, y = season, colour = category)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Seasonal Value') +
    ggtitle('Comparison Plot for Seasonal Values between Decompositional Methods')

We now look at trend values for the different decompositions.

ggplot(decomp_compare_tbl) +
    geom_line(aes(x = month, y = trend, colour = category)) +
    expand_limits(y = 0) +
    xlab('Month') +
    ylab('Trend Value') +
    ggtitle('Comparison Plot for Trends between Decompositional Methods')

4.3 Additional Decompositions

Many more approaches to time-series decomposition exist, but a thorough exploration of the techniques is beyond the scope of this workshop. Furthermore, while a number of excellent packages exist in R to perform this analysis, at the time of this workshop, they are not compatible with tidyquant.

That said, some of these are very useful, so we will introduce them and show some simple uses of them, and leave further exploration as an exercise.

4.3.1 X11 Decomposition

The X11 method was developed by US Census Bureau and Statistics Canada to assist them in the analysis of their statistical data collection and analysis.

Building upon the existing approaches discussed already, the approach allows for the changes in seasonal effects over time, different day counts in each month and other wrinkles such as holidays.

The seasonal package provides an implementation of this decomposition, and while we do not have a tidyquant interface for it, the forecast package provides some simple plotting routines.

ap_x11_decompose <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    seas(x11 = '')

ap_x11_decompose %>% plot()

ap_x11_decompose %>%
    autoplot() + ggtitle("Output of X11 Decomposition on Air Passenger Data")

4.3.2 SEATS Decomposition

“SEATS” stands for Seasonal Extraction in ARIMA Time Series. Developed by the Bank of Spain, this approach is commonly used by government agencies but only works with quarterly and monthly data - data with seasonal effects at other frequencies will require a different method.

As we did with X11, we will show the output for the default and leave its exploration as an exercise.

ap_seats_decompose <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    seas()

ap_seats_decompose %>% plot()

ap_seats_decompose %>%
    autoplot() + ggtitle("Output of SEATS Decomposition on Air Passenger Data")

5 Time-Series Structure

Assuming we can remove the trend and the seasonal variation, we still have the random component, \(z_t\). Up to now, we have modelled this random component as a sequence of random variables.

To simplify the analysis, we often make assumptions such like the random form a sequence of independent and identically distributed (i.i.d.) random variables, but this is rarely effective.

Most of the time, the \(z_t\) are correlated and so we need to put some structure on our framework for modelling this.

To that end, we now look at some statistical concepts and quantities that are used to help with this modelling piece.

5.1 Random Variables

The expected value or expectation of a random variable \(x\), denoted \(E(x)\), is the mean value of \(x\) in the population. Thus, for a continuous \(x\), we have

\[ \mu = E(x) = \int p(x) \, x \, dx. \]

and the variance, \(\sigma^2\), is the expectation of the squared deviations,

\[ \text{Var}(x) = \sigma^2 = E[(x - \mu)^2], \]

For bivariate data, each datapoint can be represented as \((x, y)\) and we can generalise this concept to the covariance, \(\gamma(x, y)\),

\[ \gamma(x, y) = \text{Cov}(x, y) = E[(x - \mu_x)(y - \mu_y)]. \]

Correlation, \(\rho\), is standardised covariance – we scale the covariance by the standard deviation of the two variables,

\[ \rho(x, y) = \frac{\gamma(x, y)}{\sigma_x \sigma_y}. \]

The mean function of a time series model is

\[ \mu(t) = E(x_t), \]

with the expectation being across the ensemble across histories of possible time series produced by this model. In general, we only have one realisation of this model, and so, without any further assumption, estimate the mean to be the measured value.

If the mean function is constant, we say that the time-series is stationary in the mean, and the estimate of the population mean is just the sample mean,

\[ \mu = \sum^n_{t=1} x_t. \]

The variance function of a time-series model that is stationary in the mean is given by

\[ \sigma^2(t) = E[(x_t - \mu)^2]. \]

If we make the further assumption that the time-series is also stationary in the variance, then the population variance is just the sample variance:

\[ \text{Var}(x) = \frac{\sum(x_t - \mu)^2}{n - 1} \]

5.1.1 Exercises

  1. Are any of the time series we have analysed stationary in the mean?
  2. What about first differences?
  3. What about percentage changes?

5.2 Autocorrelation (Serial Correlation)

Autocorrelation, also known as serial correlation, is the correlation between random variables at different time intervals in a time series. We define the lag, \(k\) to be a fixed discrete interval for analysis. From this, we define the autocovariance function and the autocorrelation function as functions of \(k\):

\[\begin{eqnarray*} \gamma_k &=& E[(x_t - \mu)(x_{t+k} - \mu)], \\ \rho_k &=& \frac{\gamma_k}{\sigma^2}. \end{eqnarray*}\]

In R, the acf() function plots the correlogram, the plot of the sample autocorrelation at \(\rho_k\) against the lag \(k\).

5.2.1 Analysing the AirPassenger Data

We now turn our attention to calculating autocorrelations for a given time series, starting with the Air Passenger data.

Autocorrelations and the correlogram are one of the most important concepts and tools in all of time series analysis, so while routines do exist to perform the calculations and produce the plots, we will first go through the entire process manually to ensure we have a thorough understanding of it all.

NOTE: Due to internal differences in how correlations are calculated (cor() uses a different denominator to acf()) - our two methods will produce slightly different values. In all cases, the values produced by acf() and functions it depends on are the ones to trust.

Once we have gone through this process once, we will use the more direct procedures.

ap_lags_tbl <- airpassengers_tbl %>%
    tq_mutate(
        select     = value
       ,mutate_fun = lag.xts
       ,k          = 1:24
       ,col_rename = paste0('l', 1:24)
    ) %>%
    gather('lag_k','lag_val', -month, -value) %>%
    mutate(lag_k = lag_k %>% str_replace('^l', '') %>% as.integer())

ap_lags_tbl %>% glimpse()
## Observations: 3,456
## Variables: 4
## $ month   <S3: yearmon> Jan 1949, Feb 1949, Mar 1949, Apr 1949, May 1949, J...
## $ value   <dbl> 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118,...
## $ lag_k   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ lag_val <dbl> NA, 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, ...
ap_lags_tbl %>% print()
## # A tibble: 3,456 x 4
##    month         value lag_k lag_val
##    <S3: yearmon> <dbl> <int>   <dbl>
##  1 Jan 1949        112     1      NA
##  2 Feb 1949        118     1     112
##  3 Mar 1949        132     1     118
##  4 Apr 1949        129     1     132
##  5 May 1949        121     1     129
##  6 Jun 1949        135     1     121
##  7 Jul 1949        148     1     135
##  8 Aug 1949        148     1     148
##  9 Sep 1949        136     1     148
## 10 Oct 1949        119     1     136
## # ... with 3,446 more rows

To calculate the autocorrelations at various lags we group by the lag amount, calculate the correlation where we have values (we do not want to include the NA values in the calculation).

Note that even independent pairs of numbers will be likely to calculate a small amount of correlation due to sample noise. This threshold can be shown to be

\[ \text{Threshold} = \frac{2}{\sqrt{N}} \]

We include these cutoff bounds in the calculation.

ap_autocorrs_tbl <- ap_lags_tbl %>%
    group_by(lag_k) %>%
    summarise(
        cor          = cor(x = value, y = lag_val, use = "pairwise.complete.obs")
       ,cutoff_upper =  2 / sqrt(n())
       ,cutoff_lower = -2 / sqrt(n())
    )

ap_autocorrs_tbl %>% glimpse()
## Observations: 24
## Variables: 4
## $ lag_k        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ cor          <dbl> 0.960195, 0.895675, 0.837395, 0.797735, 0.785943, 0.78...
## $ cutoff_upper <dbl> 0.166667, 0.166667, 0.166667, 0.166667, 0.166667, 0.16...
## $ cutoff_lower <dbl> -0.166667, -0.166667, -0.166667, -0.166667, -0.166667,...
ap_autocorrs_tbl %>% print()
## # A tibble: 24 x 4
##    lag_k   cor cutoff_upper cutoff_lower
##    <int> <dbl>        <dbl>        <dbl>
##  1     1 0.960        0.167       -0.167
##  2     2 0.896        0.167       -0.167
##  3     3 0.837        0.167       -0.167
##  4     4 0.798        0.167       -0.167
##  5     5 0.786        0.167       -0.167
##  6     6 0.784        0.167       -0.167
##  7     7 0.785        0.167       -0.167
##  8     8 0.792        0.167       -0.167
##  9     9 0.828        0.167       -0.167
## 10    10 0.883        0.167       -0.167
## # ... with 14 more rows

Now that we have the data in the appropriate form, it is straight-forward to construct a plot in ggplot2 to visualise the data.

ggplot(ap_autocorrs_tbl) +
    geom_line(aes(x = lag_k, y = cor)) +
    geom_line(aes(x = lag_k, y = cutoff_upper), colour = 'blue', linetype = 'dashed') +
    geom_line(aes(x = lag_k, y = cutoff_lower), colour = 'blue', linetype = 'dashed') +
    xlab('Lag') +
    ylab('Auto-correlation') +
    ggtitle('Correlogram for the Air Passenger Data')

As mentioned above, we can calculate these values directly in R by using acf() and its more user-friendly function from forecast, Acf().

ap_acf <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    Acf()

ap_acf %>% autoplot() + ggtitle('Correlogram of the Air Passenger Data')

5.2.2 Partial Autocorrelation

The correlogram is useful for checking correlations lags, but due to the nature of the calculation, the autocorrelation at higher lags is related to those at lower lags.

For this reason, we often look at partial autocorrelations - where the correlation due to earlier lags is removed. The R code for calculating the PACF is almost identical as for the ACF.

ap_pacf <- airpassengers_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    Pacf()

ap_pacf %>% autoplot() + ggtitle('Partial Correlogram of the Air Passenger Data')

We see the two main correlations are at lag 1 and lag 13, manifesting the strong annual seasonality in the data.

Partial autocorrelations are especially useful for modelling autoregressive models, and we will return to this later.

5.3 Multivariate Covariance

Multivariate series has a temporal equivalent to correlation and covariance, known as the cross-covariance function (ccvf) and the cross-correlation function (ccf),

\[\begin{eqnarray} \gamma_k(x, y) &=& E[(x_{t+k} - \mu_x)(y_t - \mu_y)], \\ \rho_k(x, y) &=& \frac{\gamma_k(x, y)}{\sigma_x \sigma_y}. \end{eqnarray}\]

Note that the above functions are not symmetric, as the lag is always on the first variable, \(x\).

A corollary of this definition is that negative lags switch the order of \((x, y)\)

\[ \gamma_k(x, y) = \gamma_{-k}(y, x) \]

5.3.1 Building Activity

Before we look at the data, we might expect to have a good example of leading and lagging variables in the Australian Building data: one of the values is the number of construction approvals, the other is the amount of activity.

With this data we might expect that the behaviour in approvals will reflect later behaviour in activity - there is a natural chain of causality there.

So, we now have something to investigate, and the code functionality we know can extend to multivariate time series.

approvactiv_cross_acf <- approvactiv_ts %>%
    Acf(plot = FALSE)

approvactiv_cross_acf %>%
    autoplot() +
        ggtitle('Cross-correlations of the Australian Building Data')

Not that these plots are not symmetric: the plots on the diagonals give us the univariate correlograms, where as plot (2,1) on the bottom left shows us the cross-correlation of approvals to activity, i.e. whether or not the approval values lead activity values.

We see quite a few lags are above the ‘noise’ threshold, indicating there is a leading effect.

The plot is the reverse: do activity values lead approval values. Here we see little evidence of this.

5.3.2 Cross-Correlation and Non-Stationary Series

In general, cross-correlation is defined for stationary time series as the for non-stationary series may have trends and seasonal effects, these may dominate the calculation. As a result, we typically remove trends and seasonality before looking for cross-correlations.

To illustrate this, we do a simple exercise of producing independent series with trends and seasonality and calculate cross-correlations.

produce_trend_series <- function(strength, trend = 1:100, N = 100) {
    x <- trend + strength * rnorm(N)
    y <- trend + strength * rnorm(N)
    
    data_tbl <- tibble(
        i = seq_along(trend)
       ,x = x
       ,y = y
    )
    
    return(data_tbl)
}

gen_data_tbl <- tibble(strength = c(1, 10, 100, 1000)) %>%
    mutate(data = map(strength, produce_trend_series)) %>%
    unnest()

ggplot(gen_data_tbl) +
    geom_line(aes(x = i, y = x)) +
    geom_line(aes(x = i, y = y), colour = 'red') +
    facet_wrap(~strength, ncol = 2, scale = 'free_y')

Having looked at the series, we see that for low values of strength the trend dominates the value. We now construct the cross-correlation function for each of these bivariate series.

create_ccf_plot <- function(ccf_data, strength) {
    autoplot(ccf_data) +
        ggtitle(paste('Strength', strength))
}

gen_data_tbl %>%
    group_by(strength) %>%
    summarise(ccf_data = list(Ccf(x, y, plot = FALSE))) %>%
    mutate(ccf_plot = map2(ccf_data, strength, create_ccf_plot)) %>%
    pull(ccf_plot) %>%
    plot_grid(plotlist = ., ncol = 2)

As we see, we pick up a lot of cross-correlation with a strong trend, but this drops off for the series with weaker trends.

Similarly, we look at seasonal data, constructing some toy bivariate series with seasonal effects and then look at the cross-correlations.

produce_seasonal_series <- function(strength, offset = 4, N = 370) {
    time <- 0:N

    x <- sin(2 * pi * time / 37)            + strength * rnorm(N+1)
    y <- sin(2 * pi * (time + offset) / 37) + strength * rnorm(N+1)
    
    data_tbl <- tibble(
        i = seq_along(time)
       ,x = x
       ,y = y
    )
    
    return(data_tbl)
}

gen_data_tbl <- tibble(strength = c(0, 0.1, 0.5, 1, 2, 10)) %>%
    mutate(data = map(strength, produce_seasonal_series)) %>%
    unnest()

ggplot(gen_data_tbl) +
    geom_line(aes(x = i, y = x)) +
    geom_line(aes(x = i, y = y), colour = 'red') +
    facet_wrap(~ strength, ncol = 3, scale = 'free_y')

create_ccf_plot <- function(ccf_data, strength) {
    autoplot(ccf_data) +
        ggtitle(paste('Strength', strength))
}

gen_data_tbl %>%
    group_by(strength) %>%
    summarise(ccf_data = list(Ccf(x, y, plot = FALSE))) %>%
    mutate(ccf_plot = map2(ccf_data, strength, create_ccf_plot)) %>%
    pull(ccf_plot) %>%
    plot_grid(plotlist = ., ncol = 3)

6 Basic Forecasting

The basic task of forecasting is the prediction of future values for some quantity.

One effective method of forecasting is to find a related variable whose value leads it by one or more timesteps. This related variable is often called a leading indicator.

A good example of a leading indicator is the ADP numbers in the US. ADP is the main provider of payroll software in the US, and it releases information in advance of US Unemployment statistics. Economic and financial analysts pay close attention to this quantity as it is predictive of the unemployment numbers.

In quantitative trading, one of the first popular and profitable uses of statistical and time series analysis was in pairs trading - closely related stocks tend to move together and this analysis enabled profitable trades.

So, for any given quantity, the trick is to find a leading variable and this is often unreliable or non-existent. As a result, we start with the task of using univariate analysis to make our predictions.

6.1 Forecasting with Linear Models

One possible starting point for forecasting problems is to use a linear model, much like any other predictive modelling with time as an input to the regression. We account for seasonality by using the various calendar labels as additional predictors.

Rather than fit the model on the whole dataset, we use the final year of data as our test dataset.

ap_lm_data_tbl <- airpassengers_tbl %>%
    mutate(month_label = format(month, '%m')
          ,year_num    = year(month)
           )

ap_train_tbl <- ap_lm_data_tbl %>% filter(year_num != 1960)
ap_test_tbl  <- ap_lm_data_tbl %>% filter(year_num == 1960)
maine_train_tbl <- maine_tbl %>% filter(month <  as.yearmon('2005-01'))
maine_test_tbl  <- maine_tbl %>% filter(month >= as.yearmon('2005-01'))

6.1.1 Simple Linear Model

We now fit a model.

ap_base_lm <- lm(value ~ 0 + month_label + year_num, data = ap_train_tbl)

ap_base_lm %>% glance()
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.994         0.993  23.5     1459. 1.60e-124    13  -597. 1223. 1263.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
ap_base_lm %>% tidy()
## # A tibble: 13 x 5
##    term          estimate std.error statistic  p.value
##    <chr>            <dbl>     <dbl>     <dbl>    <dbl>
##  1 month_label01 -59991.   1267.        -47.4 4.49e-79
##  2 month_label02 -59996.   1267.        -47.4 4.45e-79
##  3 month_label03 -59961.   1267.        -47.3 4.76e-79
##  4 month_label04 -59968.   1267.        -47.3 4.69e-79
##  5 month_label05 -59964.   1267.        -47.3 4.73e-79
##  6 month_label06 -59926.   1267.        -47.3 5.08e-79
##  7 month_label07 -59891.   1267.        -47.3 5.43e-79
##  8 month_label08 -59889.   1267.        -47.3 5.44e-79
##  9 month_label09 -59934.   1267.        -47.3 5.01e-79
## 10 month_label10 -59968.   1267.        -47.3 4.69e-79
## 11 month_label11 -59999.   1267.        -47.4 4.43e-79
## 12 month_label12 -59971.   1267.        -47.4 4.67e-79
## 13 year_num          30.8     0.648      47.5 2.93e-79

We now check the fitted values against the observed.

plot_tbl <- ap_base_lm %>%
    augment() %>%
    mutate(month = as.Date(paste(year_num, month_label, '01', sep = '-'))) %>%
    select(month, observed = value, fitted = .fitted) %>%
    gather('series', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = series)) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle("Comparison Lineplot of Air Passengers vs Linear Model")

We see that the additive model here may be less appropriate as a multiplicative model as our linear model over-estimates early values and then under-estimates towards the end of the series.

We expect this will continue when we try the model with the test data.

plot_tbl <- ap_base_lm %>%
    augment(newdata = ap_test_tbl) %>%
    mutate(month = as.Date(paste(year_num, month_label, '01', sep = '-'))) %>%
    select(month, observed = value, fitted = .fitted) %>%
    gather('series', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = series)) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle("Comparison Lineplot of Air Passengers vs Linear Model (1960 points)")

6.1.2 Poisson Model

A more appropriate model may be to fit the Poisson model.

ap_base_glm <- glm(value ~ 0 + month_label + year_num
                  ,data   = ap_train_tbl
                  ,family = poisson)

ap_base_glm %>% glance()
## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1       322545.     132  -532. 1091. 1128.     97.6         119
ap_base_glm %>% tidy()
## # A tibble: 13 x 5
##    term          estimate std.error statistic p.value
##    <chr>            <dbl>     <dbl>     <dbl>   <dbl>
##  1 month_label01 -231.      3.47        -66.5       0
##  2 month_label02 -231.      3.47        -66.5       0
##  3 month_label03 -231.      3.47        -66.5       0
##  4 month_label04 -231.      3.47        -66.5       0
##  5 month_label05 -231.      3.47        -66.5       0
##  6 month_label06 -231.      3.47        -66.5       0
##  7 month_label07 -230.      3.47        -66.4       0
##  8 month_label08 -230.      3.47        -66.4       0
##  9 month_label09 -231.      3.47        -66.5       0
## 10 month_label10 -231.      3.47        -66.5       0
## 11 month_label11 -231.      3.47        -66.5       0
## 12 month_label12 -231.      3.47        -66.5       0
## 13 year_num         0.121   0.00177      68.1       0

We now check the fitted values against the observed.

plot_tbl <- ap_base_glm %>%
    augment(type.predict = 'response') %>%
    mutate(month = as.Date(paste(year_num, month_label, '01', sep = '-'))) %>%
    select(month, observed = value, fitted = .fitted) %>%
    gather('series', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = series)) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle("Comparison Lineplot of Air Passengers vs Poisson Model")

We see that a Poisson model does a better job of capturing the magnitude of the values throughout the series. We hope this means the 1960 data points are better matched also.

plot_tbl <- ap_base_glm %>%
    augment(newdata = ap_test_tbl, type.predict = 'response') %>%
    mutate(month = as.Date(paste(year_num, month_label, '01', sep = '-'))) %>%
    select(month, observed = value, fitted = .fitted) %>%
    gather('series', 'value', -month)

ggplot(plot_tbl) +
    geom_line(aes(x = month, y = value, colour = series)) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle("Comparison Lineplot of Air Passengers vs Poisson Model (1960 points)")

6.2 Time Series Forecasting

Our main objective in forecasting is to estimate the value of a future quantity, \(x_{n+k}\), given past values \({x_1, x_2, ..., x_n}\). We assume no seasonal or trend effects, or that such effects are removed from the data.

We assume that the underlying mean of the data is \(\mu_t\), and that the value \(x_t\) changes from timestep to timestep, but this change is random.

Our model can be expressed as

\[ x_t = \mu_t + w_t, \]

where

\[\begin{eqnarray*} \mu_t &=& \text{non-stationary mean of the process at time } t \\ w_t &\sim& \mathcal{N}(0, \sigma) \text{ (i.i.d)} \end{eqnarray*}\]

6.2.1 EWMA Models

We let \(a_t\) be our estimate of \(\mu_t\), and can define the exponentially-weighted moving average (EWMA), \(a_t\) to be

\[ a_t = \alpha x_t + (1 - \alpha) \, a_{t-1}, \;\;\; 0 \leq \alpha \leq 1. \]

The value of \(\alpha\) controls the amount of smoothing, as is referred to as the smoothing parameter. An alternative name for this time-series model is simple exponential smoothing (SES).

The main package we use for forecasting in R is the forecast package, created by Rob Hyndman. forecast contains a convenient framework for performing most of the task required for time series forecast. We then use the sweep package to bring the outputs into a tidy format where appropriate.

6.2.1.1 Air Passenger Data

We start using the EWMA model on the Air Passenger data.

As a simple approach for testing our models, we remove the final year from the air passenger data and produce forecasts for those, comparing them to the actual values.

ap_ewma_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    ets(model = 'ANN')

ap_ewma_fit %>% sw_tidy()
## # A tibble: 2 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 alpha    1.000
## 2 l      112.
ap_ewma_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE   MPE  MAPE  MASE  ACF1
##   <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(A,N,N)  31.5  -776. 1559. 1568.  2.22  31.2  23.9 0.413  8.91 0.785 0.286

Now that we have fit our model we can use the forecast() function to produce forecasts for future times, as well as plotting them.

ap_ewma_forecast_tbl <- ap_ewma_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')


ggplot(ap_ewma_forecast_tbl) +
    geom_line  (aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Forecast for Air Passengers with EWMA Model')

We see that the EWMA model is not effective in terms prediction.

We can compare the fits against the data directly with sw_augment():

ap_ewma_augment_tbl <- ap_ewma_fit %>%
    sw_augment(rename_index = 'month')


ggplot(ap_ewma_augment_tbl) +
    geom_line(aes(x = month, y = .actual)) +
    geom_line(aes(x = month, y = .fitted), colour = 'red') +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Forecast for Air Passengers with EWMA Model')

6.2.1.2 Maine Unemployment

We now try the EWMA model on the Maine Unemployment data to see if this model works any better.

maine_ewma_fit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    ets(model = 'ANN')

maine_ewma_fit %>% sw_tidy()
## # A tibble: 2 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 alpha    1.000
## 2 l        6.70
maine_ewma_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC      ME  RMSE   MAE    MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 ETS(A,N,N) 0.442  -164.  333.  341. -0.0194 0.438 0.329 -0.812  7.52 0.625
## # ... with 1 more variable: ACF1 <dbl>

As before, having fit the EWMA model we now produce forecasts.

maine_ewma_forecast_tbl <- maine_ewma_fit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')


ggplot(maine_ewma_forecast_tbl) +
    geom_line  (aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Forecast for Maine Unemployment Data with EWMA Model')

In some cases, with such a simple model we may prefer to set values for \(\alpha\) directly rather than fitting it to the data. We do this by passing the chosen parameters directly into the fitting routine:

maine_ewma_paramfit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    ets(model = 'ANN', alpha = 0.8)

maine_ewma_paramfit %>% sw_tidy()
## # A tibble: 2 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 l         6.68
## 2 alpha     0.8
maine_ewma_paramfit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC      ME  RMSE   MAE   MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(A,N,N) 0.470  -170.  347.  355. -0.0243 0.466 0.353 -1.07  8.06 0.672
## # ... with 1 more variable: ACF1 <dbl>

We now produce forecasts using the set EWMA parameter to see how they compare.

maine_ewma_paramfit_forecast_tbl <- maine_ewma_paramfit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')


ggplot(maine_ewma_paramfit_forecast_tbl) +
    geom_line  (aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Forecast for Maine Unemployment Data with Non-Fitted EWMA Model')

6.2.2 Holt-Winters Model

The Holt-Winters method generalises this concept, allowing for trends and seasonal effects. The equations that govern this model for seasonal period, \(p\), are given by

\[\begin{eqnarray*} a_t &=& \alpha (x_t - s_{t-p}) + (1 - \alpha) (a_{t-1} - b_{t-1}),\\ b_t &=& \beta (a_t - a_{t-1}) + (1 - \beta) b_{t-1},\\ s_t &=& \gamma (x_t - a_t) + (1 - \gamma) s_{t-p}, \end{eqnarray*}\]

where

\[\begin{eqnarray*} a_t && \text{is the estimated level at time $t$},\\ b_t && \text{is the estimated slope at time $t$},\\ s_t && \text{is the estimated seasonal effect at time $t$},\\ \alpha, \beta, \gamma && \text{are smoothing parameters} \end{eqnarray*}\]

Within the framework of the forecast package, we have an additive trend with an additive seasonal component and additive errors. This corresponds to using the ‘AAA’ ets() model

6.2.2.1 Air Passenger Data

We start with the Air Passenger data, and use the ‘AAA’ model within ets().

ap_hw_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    ets(model = 'AAA')

ap_hw_fit %>% sw_tidy()
## # A tibble: 16 x 2
##    term     estimate
##    <chr>       <dbl>
##  1 alpha    1.000   
##  2 beta     0.000266
##  3 gamma    0.000323
##  4 l      121.      
##  5 b        1.60    
##  6 s0     -26.8     
##  7 s1     -51.0     
##  8 s2     -21.3     
##  9 s3      14.7     
## 10 s4      59.5     
## 11 s5      59.5     
## 12 s6      33.4     
## 13 s7      -6.52    
## 14 s8      -7.52    
## 15 s9       2.45    
## 16 s10    -32.2
ap_hw_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE   MPE  MAPE  MASE  ACF1
##   <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(A,A,A)  16.4  -683. 1401. 1450. 0.747  15.4  11.6 0.259  5.00 0.380 0.164
ap_hw_forecast_tbl <- ap_hw_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')


ggplot(ap_hw_forecast_tbl) +
    geom_line  (aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Forecast for Air Passengers with Holt-Winters Model')

ap_hw_augment_tbl <- ap_hw_fit %>%
    sw_augment(rename_index = 'month')

ggplot(ap_hw_augment_tbl) +
    geom_line(aes(x = month, y = .actual)) +
    geom_line(aes(x = month, y = .fitted), colour = 'red') +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('sw_augment Output for Holt-Winters Model')

For this data the multiplicative trend and errors is not the most appropriate so we will refit this data with those changes.

Rather than run through the full set of code again we run a more concise version of this code.

ap_hwmult_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    ets(model = 'MAM')

ap_hwmult_forecast_tbl <- ap_hwmult_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

ap_hwmult_fit %>% sw_tidy()
## # A tibble: 17 x 2
##    term    estimate
##    <chr>      <dbl>
##  1 alpha   0.758   
##  2 beta    0.0213  
##  3 gamma   0.000106
##  4 phi     0.980   
##  5 l     121.      
##  6 b       1.76    
##  7 s0      0.897   
##  8 s1      0.798   
##  9 s2      0.919   
## 10 s3      1.06    
## 11 s4      1.22    
## 12 s5      1.23    
## 13 s6      1.11    
## 14 s7      0.978   
## 15 s8      0.980   
## 16 s9      1.02    
## 17 s10     0.893
ap_hwmult_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc  sigma logLik   AIC   BIC    ME  RMSE   MAE   MPE  MAPE  MASE
##   <chr>       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(M,Ad,~ 0.0378  -604. 1244. 1296.  1.51  9.35  6.98 0.442  2.76 0.229
## # ... with 1 more variable: ACF1 <dbl>
ggplot(ap_hwmult_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Air Passengers Forecast for Multiplicative Holt-Winters Model')

6.2.2.2 Maine Unemployment Data

We now try to fit our models with the Maine Unemployment data, repeating the process from before.

maine_hw_fit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    ets(model = 'ANN')

maine_hw_forecast_tbl <- maine_hw_fit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

maine_hw_fit %>% sw_tidy()
## # A tibble: 2 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 alpha    1.000
## 2 l        6.70
maine_hw_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC      ME  RMSE   MAE    MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 ETS(A,N,N) 0.442  -164.  333.  341. -0.0194 0.438 0.329 -0.812  7.52 0.625
## # ... with 1 more variable: ACF1 <dbl>
ggplot(maine_hw_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Additive Holt-Winters Model for the Maine Unemployment Data')

As before, we can try a multiplicative model.

maine_hwmult_fit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    ets(model = 'MAM')

maine_hwmult_forecast_tbl <- maine_hwmult_fit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

maine_hwmult_fit %>% sw_tidy()
## # A tibble: 17 x 2
##    term  estimate
##    <chr>    <dbl>
##  1 alpha   0.648 
##  2 beta    0.141 
##  3 gamma   0.0497
##  4 phi     0.807 
##  5 l       5.80  
##  6 b      -0.327 
##  7 s0      0.968 
##  8 s1      0.978 
##  9 s2      0.885 
## 10 s3      0.834 
## 11 s4      0.822 
## 12 s5      0.915 
## 13 s6      0.961 
## 14 s7      0.948 
## 15 s8      1.09  
## 16 s9      1.17  
## 17 s10     1.22
maine_hwmult_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc  sigma logLik   AIC   BIC      ME  RMSE   MAE      MPE  MAPE  MASE
##   <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>
## 1 ETS(M,Ad,~ 0.0403  -55.0  146.  194. 0.00129 0.169 0.127 -1.08e-5  2.87 0.241
## # ... with 1 more variable: ACF1 <dbl>
ggplot(maine_hwmult_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Multiplicative Holt-Winters Model for the Maine Unemployment Data')

We see that including a seasonal component into the model improves the predictions

6.3 Automated Model Selection

The ets() function allows us to automatically select a model given the time series data.

The exponentially-smoothed models fitted allows us to use AIC and BIC as a decision metric.

The Aikake Information Criterion (AIC) is defined to be:

\[ \text{AIC} = -2 \log(L) + 2k, \]

where \(L\) is the likelihood of the fitted model and \(k\) is the number of degrees of freedom in the model (free parameters, initial states and residual variance).

For small sample sizes this approach may still be prone to overfitting, so we may use a corrected version for small-sample bias, (\(\text{AIC}_c\))

\[ \text{AIC}_c = \text{AIC} + \frac{k(k+1)}{T - k -1}, \]

The Bayes Information Criterion (BIC) is defined to be:

\[ \text{BIC} = \text{AIC} + k (\log(T) - 2) \]

The ets() model allows us to select the type of error term, trend type and seasonality, letting us choose additive (A), multiplicative (M), none (N) or automatically selected (Z) for each of the three options.

The first values, the error term, cannot be ‘N’ and a number of combinations are numerically unstable and suppressed.

We now fit both the Air Passenger and Maine Unemployment data using the automatic model selection.

6.3.1 Air Passenger Models

We now fit the Air Passenger data first.

ap_autoets_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    ets(model = 'ZZZ')

ap_autoets_forecast_tbl <- ap_autoets_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

ap_autoets_fit %>% sw_tidy()
## # A tibble: 17 x 2
##    term    estimate
##    <chr>      <dbl>
##  1 alpha   0.758   
##  2 beta    0.0213  
##  3 gamma   0.000106
##  4 phi     0.980   
##  5 l     121.      
##  6 b       1.76    
##  7 s0      0.897   
##  8 s1      0.798   
##  9 s2      0.919   
## 10 s3      1.06    
## 11 s4      1.22    
## 12 s5      1.23    
## 13 s6      1.11    
## 14 s7      0.978   
## 15 s8      0.980   
## 16 s9      1.02    
## 17 s10     0.893
ap_autoets_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc  sigma logLik   AIC   BIC    ME  RMSE   MAE   MPE  MAPE  MASE
##   <chr>       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(M,Ad,~ 0.0378  -604. 1244. 1296.  1.51  9.35  6.98 0.442  2.76 0.229
## # ... with 1 more variable: ACF1 <dbl>
ggplot(ap_autoets_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Air Passengers Forecast for Automatic ETS Model')

Having fit the data and selected the model, we now can compare our different models and how the forecasts compare to each other.

ap_forecast_compare_tbl <- list(
    ewma   = ap_ewma_forecast_tbl    %>% filter(key == 'forecast')
   ,hw     = ap_hw_forecast_tbl      %>% filter(key == 'forecast')
   ,hwmult = ap_hwmult_forecast_tbl  %>% filter(key == 'forecast')
   ,auto   = ap_autoets_forecast_tbl %>% filter(key == 'forecast')
    ) %>%
    bind_rows(.id = 'model') %>%
    mutate(month = as.yearmon(month))


ggplot(ap_forecast_compare_tbl) +
    geom_line(aes(x = month, y = value, colour = model)) +
    geom_line(aes(x = month, y = value), colour = 'black', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Comparison of Air Passenger Forecast Models')

6.3.2 Maine Unemployment Models

We now move on to the Maine Unemployment data, and repeat the procedure.

maine_autoets_fit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    ets(model = 'ZZZ')

maine_autoets_forecast_tbl <- maine_autoets_fit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

maine_autoets_fit %>% sw_tidy()
## # A tibble: 17 x 2
##    term  estimate
##    <chr>    <dbl>
##  1 alpha   0.648 
##  2 beta    0.141 
##  3 gamma   0.0497
##  4 phi     0.807 
##  5 l       5.80  
##  6 b      -0.327 
##  7 s0      0.968 
##  8 s1      0.978 
##  9 s2      0.885 
## 10 s3      0.834 
## 11 s4      0.822 
## 12 s5      0.915 
## 13 s6      0.961 
## 14 s7      0.948 
## 15 s8      1.09  
## 16 s9      1.17  
## 17 s10     1.22
maine_autoets_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc  sigma logLik   AIC   BIC      ME  RMSE   MAE      MPE  MAPE  MASE
##   <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>
## 1 ETS(M,Ad,~ 0.0403  -55.0  146.  194. 0.00129 0.169 0.127 -1.08e-5  2.87 0.241
## # ... with 1 more variable: ACF1 <dbl>
ggplot(maine_autoets_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Maine Unemployment Forecast for Automatic ETS Model')

As before, we compare the forecasts against one another.

maine_forecast_compare_tbl <- list(
    ewma      = maine_ewma_forecast_tbl          %>% filter(key == 'forecast')
   ,ewmaparam = maine_ewma_paramfit_forecast_tbl %>% filter(key == 'forecast')
   ,hw        = maine_hw_forecast_tbl            %>% filter(key == 'forecast')
   ,hwmult    = maine_hwmult_forecast_tbl        %>% filter(key == 'forecast')
   ,auto      = maine_autoets_forecast_tbl       %>% filter(key == 'forecast')
    ) %>%
    bind_rows(.id = 'model') %>%
    mutate(month = as.yearmon(month))


ggplot(maine_forecast_compare_tbl) +
    geom_line(aes(x = month, y = value, colour = model)) +
    geom_line(aes(x = month, y = value), colour = 'black', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Comparison of Maine Unemployment Forecast Models')

7 AR, MA and ARMA Models

We now turn our attention to the auto-regressive and moving average model - the second workhorse for time-series modelling.

7.1 Building Blocks

These models are built from a number of elementary model types that we first discuss before combining in different ways to model time series data seen in the real world.

7.1.1 White Noise Series

A time series \(w_t\) is if the \(w_t\) are i.i.d with a mean of zero. They all have the same variance \(\sigma^2\) and \(\text{Cor}(w_i, w_j) = 0\) for \(i \neq j\).

In addition, if \(w_j \sim N(0, \sigma^2)\) then it is called .

A time series \(x_t\) is a if

\[ x_t = x_{t-1} + w_t, \]

where \(w_t\) is a white-noise series.

wn_innov <- rnorm(200, 0, 1)

sim_wn_ts <- arima.sim(list()
                      ,n = length(wn_innov)
                      ,innov = wn_innov
                      ,start.innov = rep(0, 100)
                      )

sim_wn_ts_tbl <- sim_wn_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_wn_ts), .before = 1)

ggplot(sim_wn_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated Gaussian White Noise Series")

We also produce a standard and partial correlogram, but expect no significant lags.

sim_wn_ts %>% Acf()  %>% autoplot() + ggtitle("ACF for White Noise Series")

sim_wn_ts %>% Pacf() %>% autoplot() + ggtitle("PACF for White Noise Series")

7.1.2 Autoregressive Process - AR(p)

The time series \(x_t\) is an , \(\text{AR}(p)\), if,

\[\begin{eqnarray*} x_t &=& \delta + \alpha_1 x_1 + ... + \alpha_p x_p + w_t \\ &=& \alpha_0 + \sum^p_{i=1} \alpha_i x_{t-i} + w_t, \end{eqnarray*}\]

where

\[\begin{eqnarray*} w_t &=& \text{is a white-noise process} \\ \delta &=& \text{'intercept' value} \\ \alpha_p &\neq& 0 \text{ for order } p \end{eqnarray*}\]

With this definition, we can use this as a data-generating process to construct an example AR(1) model, and visual inspect the results.

sim_ar1_ts <- arima.sim(list(ar = 0.5)
                       ,n     = length(wn_innov)
                       ,innov = wn_innov
                       ,start.innov = rep(0, 100)
                        )

sim_ar1_ts_tbl <- sim_ar1_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_ar1_ts), .before = 1)

ggplot(sim_ar1_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated AR(1) Series")

Having generated the AR(1) process, we now look at the correlogram to see what it produces.

sim_ar1_ts %>% Acf()  %>% autoplot() + ggtitle("ACF for AR(1) Series")

sim_ar1_ts %>% Pacf() %>% autoplot() + ggtitle("PACF for AR(1) Series")

Moving on, in a similar fashion we can construct an AR(2) model with the same innovations, but with \(\alpha_1 = 1\), \(\alpha_2 = -0.25\):

sim_ar2_ts <- arima.sim(list(ar = c(1, -0.25))
                       ,n     = length(wn_innov)
                       ,innov = wn_innov
                       ,start.innov = rep(0, 100)
                         )

sim_ar2_ts_tbl <- sim_ar2_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_ar2_ts), .before = 1)


ggplot(sim_ar2_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated AR(2) Series")

We also look at the correlogram for this series.

sim_ar2_ts %>% Acf()  %>% autoplot() + ggtitle("ACF for AR(2) Series")

sim_ar2_ts %>% Pacf() %>% autoplot() + ggtitle("PACF for AR(2) Series")

We can try a number of different parameters here to see the effect - we try another AR(2) but this time with both parameters positive:

sim_ar2alt_ts <- arima.sim(list(ar = c(0.5, 0.4))
                          ,n     = length(wn_innov)
                          ,innov = wn_innov
                          ,start.innov = rep(0, 100)
                           )

sim_ar2alt_ts_tbl <- sim_ar2alt_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_ar2alt_ts), .before = 1)


ggplot(sim_ar2alt_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated Alternate AR(2) Series")

And the correlogram for this alternate AR(2) series.

sim_ar2alt_ts %>% Acf()  %>% autoplot() + ggtitle("ACF for Alternate AR(2) Series")

sim_ar2alt_ts %>% Pacf() %>% autoplot() + ggtitle("ACF for Alternate AR(2) Series")

To help with intuition we plot these generated data together.

sim_compare_tbl <- list(
    wn     = sim_wn_ts_tbl
   ,ar1    = sim_ar1_ts_tbl
   ,ar2    = sim_ar2_ts_tbl
   ,ar2alt = sim_ar2alt_ts_tbl
    ) %>%
    bind_rows(.id = 'model')


ggplot(sim_compare_tbl) +
    geom_line(aes(x = index, y = value, colour = model)) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Generated AR Models Comparison Plot")

7.1.3 Moving Average Process - MA(q)

A moving average (MA) process of order \(q\) is a linear combination of the current white noise term and the \(q\) most recent past white noise terms,

\[\begin{eqnarray*} x_t &=& w_t + \gamma + \beta_1 w_1 + ... + \beta_q w_q \\ &=& \gamma + w_t + \sum^q_{i=1} \beta_i w_{t-i} \end{eqnarray*}\]

where \(w_t\) is a white-noise process with mean 0 and variance \(\sigma^2\).

\[\begin{eqnarray*} w_t &=& \text{is a white-noise process with mean 0 and variance } \sigma^2 \\ \gamma &=& \text{'intercept' value} \\ \beta_q &\neq& 0 \text{ for order } q \end{eqnarray*}\]

sim_ma1_ts <- arima.sim(list(ma = 0.5)
                       ,n     = length(wn_innov)
                       ,innov = wn_innov
                       ,start.innov = rep(0, 100)
                        )

sim_ma1_ts_tbl <- sim_ma1_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_ma1_ts), .before = 1)

ggplot(sim_ma1_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated MA(1) Series")

We check the MA(1) correlogram:

sim_ma1_ts %>% Acf()  %>% autoplot() + ggtitle("ACF for MA(1) Series")

sim_ma1_ts %>% Pacf() %>% autoplot() + ggtitle("ACF for MA(1) Series")

Like we did before, we now generate an MA(2) model.

sim_ma2_ts <- arima.sim(list(ma = c(1, -0.25))
                       ,n     = length(wn_innov)
                       ,innov = wn_innov
                       ,start.innov = rep(0, 100)
                         )

sim_ma2_ts_tbl <- sim_ma2_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_ma2_ts), .before = 1)


ggplot(sim_ma2_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated MA(2) Series")

Once again we look at the MA(2) series.

sim_ma2_ts %>% Acf()  %>% autoplot() + ggtitle("ACF for MA(2) Series")

sim_ma2_ts %>% Pacf() %>% autoplot() + ggtitle("ACF for MA(2) Series")

sim_compare_tbl <- list(
    wn  = sim_wn_ts_tbl
   ,ma1 = sim_ma1_ts_tbl
   ,ma2 = sim_ma2_ts_tbl
    ) %>%
    bind_rows(.id = 'model')


ggplot(sim_compare_tbl) +
    geom_line(aes(x = index, y = value, colour = model)) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Generated MA Models Comparison Plot")

7.1.4 Comparing AR and MA Models

Before moving on to ARMA and ARIMA models we combine all our generated series and plot them all against each other.

sim_compare_tbl <- list(
    wn     = sim_wn_ts_tbl
   ,ar1    = sim_ar1_ts_tbl
   ,ar2    = sim_ar2_ts_tbl
   ,ar2alt = sim_ar2alt_ts_tbl
   ,ma1    = sim_ma1_ts_tbl
   ,ma2    = sim_ma2_ts_tbl
    ) %>%
    bind_rows(.id = 'model')


ggplot(sim_compare_tbl) +
    geom_line(aes(x = index, y = value, colour = model)) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Generated AR and MA Models Comparison Plot")

7.2 ARMA and ARIMA Models

Autoregressive moving-average models combine both the AR and MA models together. The benefit of combining both models is due to the principle of parsimony - given two equal quality models, we prefer the one with the lower number of parameters.

In general, ARMA models fit time series with a lower number of parameters than either AR or MA models alone, so they are preferred.

More formally, we define an \(\text{ARMA}(p, q)\) model to be

\[ x_t = \sum_{i=1}^p \alpha_i x_{t-i} + w_t + \sum_{j=1}^q \beta_j w_{t-j} \]

where \(w_t\) is white noise.

Both \(\text{AR}(p)\) and \(\text{MA}(q)\) models are special cases of \(\text{ARMA}(p, q)\) (with \(q = 0\) and \(p = 0\) respectively).

We now fit an \(\text{ARMA}(1, 1)\) model using the same white-noise innovations as before.

sim_arma_1_1_ts <- arima.sim(list(ar = 0.5, ma = 0.5)
                            ,n     = length(wn_innov)
                            ,innov = wn_innov
                            ,start.innov = rep(0, 100)
                             )

sim_arma_1_1_ts_tbl <- sim_arma_1_1_ts %>%
    tk_tbl() %>%
    add_column(index = 1:length(sim_arma_1_1_ts), .before = 1)

ggplot(sim_arma_1_1_ts_tbl) +
    geom_line(aes(x = index, y = value)) +
    expand_limits(y = 0) +
    xlab("Index") +
    ylab("Value") +
    ggtitle("Plot of Generated ARMA(1,1) Series")

We use auto.arima() to fit parameters for this data to see if we recover the given parameters.

sim_arma_1_1_ts %>%
    auto.arima() %>%
    sw_tidy()
## # A tibble: 2 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 ar1      0.545
## 2 ma1      0.550

7.2.1 ARIMA Models

As discussed earlier, it is often more effective to model the differences in a time series rather than the raw values.

A variant of the ARMA is the Auto-Regressive Integrated Moving Average model which applies the ARMA model to the differences. ARIMA model are specified With a third order value, \(d\) - the order of the differences, and so are described as \(\text{ARIMA}(p, d, q)\) models.

While there is no theoretical limit to \(d\), in practice it is uncommon to see \(d > 1\).

7.3 Fitting ARMA and ARIMA Models

We now turn our attention to fitting ARMA and ARIMA models.

In some cases, we use a combination of domain knowledge and autocorrelation analysis to determine the order of the AR(I)MA model we fit, but generally we go with the best fit as determined by AIC or BIC (and ensure this ‘best’ fit is still useful and makes sense).

7.3.1 Fitting the Air Passenger Data

As always, we start with the Air Passenger data.

Purely for illustration, we start with the ARIMA(1, 0, 1) model - which is equivalent to the ARMA(1, 1) - and then try the ARIMA(1, 1, 1).

ap_arma_1_1_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    Arima(order = c(1, 0, 1))

ap_arma_1_1_forecast_tbl <- ap_arma_1_1_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

ap_arma_1_1_fit %>% sw_tidy()
## # A tibble: 3 x 2
##   term      estimate
##   <chr>        <dbl>
## 1 ar1          0.934
## 2 ma1          0.399
## 3 intercept  264.
ap_arma_1_1_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE    MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 ARIMA(1,0~  29.5  -634. 1276. 1287.  1.33  29.1  22.9 -0.731  8.88 0.753
## # ... with 1 more variable: ACF1 <dbl>
ggplot(ap_arma_1_1_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Air Passenger Forecast for ARMA(1, 1) Model')

This is not a good model.

ap_arma_1_1_1_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    Arima(order = c(1, 1, 1))

ap_arma_1_1_1_forecast_tbl <- ap_arma_1_1_1_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

ap_arma_1_1_1_fit %>% sw_tidy()
## # A tibble: 2 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 ar1     -0.544
## 2 ma1      0.927
ap_arma_1_1_1_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE   MPE  MAPE  MASE   ACF1
##   <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA(1,1~  28.8  -626. 1257. 1266.  1.89  28.5  22.5 0.443  8.58 0.740 0.0484
ggplot(ap_arma_1_1_1_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Air Passenger Forecast for ARIMA(1, 1, 1) Model')

This is not a good model either. We need to include a seasonal component to this process to use it properly.

Before we move on, we try using the auto.arima() function to see what the best model looks like.

ap_autoarima_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    auto.arima(seasonal = FALSE)

ap_autoarima_forecast_tbl <- ap_autoarima_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

ap_autoarima_fit %>% sw_tidy()
## # A tibble: 9 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 ar1    -0.0136
## 2 ar2     0.967 
## 3 ar3    -0.122 
## 4 ar4    -0.778 
## 5 ma1    -0.0312
## 6 ma2    -1.49  
## 7 ma3     0.0300
## 8 ma4     0.863 
## 9 drift   2.66
ap_autoarima_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE    MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 ARIMA(4,1~  22.8  -594. 1208. 1237. 0.249  21.9  17.1 -0.700  6.71 0.563
## # ... with 1 more variable: ACF1 <dbl>
ggplot(ap_autoarima_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Air Passenger Forecast for Auto-ARIMA Model')

We will look at including seasonal elements to the ARIMA model shortly.

7.3.2 Fitting the Maine Unemployment Data

We will not try to fit any particular order of ARIMA model for the Maine Unemployment data and move straight to using auto.arima() to fit the models.

maine_autoarima_fit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    auto.arima(seasonal = FALSE)

maine_autoarima_forecast_tbl <- maine_autoarima_fit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

maine_autoarima_fit %>% sw_tidy()
## # A tibble: 1 x 2
##   term  estimate
##   <lgl> <lgl>   
## 1 NA    NA
maine_autoarima_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC      ME  RMSE   MAE    MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 ARIMA(0,1~ 0.440  -63.9  130.  133. -0.0194 0.438 0.329 -0.811  7.52 0.625
## # ... with 1 more variable: ACF1 <dbl>
ggplot(maine_autoarima_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Maine Unemployment Forecast for Auto-ARIMA Model')

7.4 Seasonal ARIMA Models

We add seasonality to an ARIMA model by adding separate analysis for the seasonal component of the time series.

Full detail of this approach is beyond the scope of this workshop, but we specify a SARIMA model as \(\text{SARIMA}(p, d, q)(P, D, Q)_m\), where the capitals \((P, D, Q)\) are for the seasonal part of the model, with \(m\) denoting the number of data points per seasonal cycle - so we have \(m = 12\) for monthly data with annual seasonality.

7.4.1 Fitting SARIMA Models

Fitting a Seasonal-ARIMA is a matter of adding the seasonal = TRUE parameter to auto.arima().

We now fit SARIMA models for both the Air Passenger and Maine Unemployment data.

ap_sarima_fit <- ap_train_tbl %>%
    tk_ts(select = value, start = 1949, frequency = 12) %>%
    auto.arima(seasonal = TRUE)

ap_sarima_forecast_tbl <- ap_sarima_fit %>%
    forecast(h = 12, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

ap_sarima_fit %>% sw_tidy()
## # A tibble: 1 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 ar1     -0.243
ap_sarima_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE   MPE  MAPE  MASE   ACF1
##   <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA(1,1~  10.5  -448.  900.  905. 0.579  9.91  7.48 0.119  2.88 0.246 0.0123
ggplot(ap_sarima_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = ap_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Air Passenger Forecast for Seasonal ARIMA Model')

Now we fit the Maine Unemployment data.

maine_sarima_fit <- maine_train_tbl %>%
    tk_ts(select = value, start = 1996, frequency = 12) %>%
    auto.arima(seasonal = TRUE)

maine_sarima_forecast_tbl <- maine_sarima_fit %>%
    forecast(h = 20, level = c(50, 80)) %>%
    sw_sweep(rename_index = 'month')

maine_sarima_fit %>% sw_tidy()
## # A tibble: 1 x 2
##   term  estimate
##   <chr>    <dbl>
## 1 sma1    -0.856
maine_sarima_fit %>% sw_glance()
## # A tibble: 1 x 12
##   model.desc sigma logLik   AIC   BIC     ME  RMSE   MAE   MPE  MAPE  MASE
##   <chr>      <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(0,1~ 0.195   13.5 -23.0 -17.9 0.0172 0.182 0.129 0.480  3.06 0.245
## # ... with 1 more variable: ACF1 <dbl>
ggplot(maine_sarima_forecast_tbl) +
    geom_line(aes(x = month, y = value, colour = key)) +
    geom_ribbon(aes(x = month, ymin = lo.50, ymax = hi.50), alpha = 0.2) +
    geom_line  (aes(x = month, y = value), colour = 'red', data = maine_test_tbl) +
    expand_limits(y = 0) +
    xlab("Date") +
    ylab("Count") +
    ggtitle('Maine Unemployment Forecast for Seasonal ARIMA Model')

8 R Environment

devtools::session_info()
## - Session info ---------------------------------------------------------------
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RStudio                     
##  language (EN)                        
##  collate  English_Ireland.1252        
##  ctype    English_Ireland.1252        
##  tz       Europe/London               
##  date     2019-03-09                  
## 
## - Packages -------------------------------------------------------------------
##  package              * version  date       lib source        
##  assertthat             0.2.0    2017-04-11 [1] CRAN (R 3.5.1)
##  backports              1.1.3    2018-12-14 [1] CRAN (R 3.5.1)
##  broom                * 0.5.1    2018-12-05 [1] CRAN (R 3.5.1)
##  callr                  3.1.1    2018-12-21 [1] CRAN (R 3.5.2)
##  cellranger             1.1.0    2016-07-27 [1] CRAN (R 3.5.1)
##  cli                    1.0.1    2018-09-25 [1] CRAN (R 3.5.1)
##  colorspace             1.4-0    2019-01-13 [1] CRAN (R 3.5.2)
##  cowplot              * 0.9.4    2019-01-08 [1] CRAN (R 3.5.2)
##  cranlogs             * 2.1.0    2015-12-07 [1] CRAN (R 3.5.2)
##  crayon                 1.3.4    2017-09-16 [1] CRAN (R 3.5.1)
##  curl                   3.3      2019-01-10 [1] CRAN (R 3.5.1)
##  desc                   1.2.0    2018-05-01 [1] CRAN (R 3.5.1)
##  devtools             * 2.0.1    2018-10-26 [1] CRAN (R 3.5.1)
##  digest                 0.6.18   2018-10-10 [1] CRAN (R 3.5.1)
##  dplyr                * 0.8.0.1  2019-02-15 [1] CRAN (R 3.5.2)
##  evaluate               0.13     2019-02-12 [1] CRAN (R 3.5.2)
##  fansi                  0.4.0    2018-10-05 [1] CRAN (R 3.5.1)
##  forcats              * 0.4.0    2019-02-17 [1] CRAN (R 3.5.2)
##  forecast             * 8.5      2019-01-18 [1] CRAN (R 3.5.2)
##  fortunes               1.5-4    2016-12-29 [1] CRAN (R 3.5.0)
##  fracdiff               1.4-2    2012-12-02 [1] CRAN (R 3.5.2)
##  fs                     1.2.6    2018-08-23 [1] CRAN (R 3.5.1)
##  generics               0.0.2    2018-11-29 [1] CRAN (R 3.5.1)
##  ggplot2              * 3.1.0    2018-10-25 [1] CRAN (R 3.5.1)
##  glue                   1.3.0    2018-07-17 [1] CRAN (R 3.5.1)
##  gtable                 0.2.0    2016-02-26 [1] CRAN (R 3.5.1)
##  haven                  2.1.0    2019-02-19 [1] CRAN (R 3.5.2)
##  hms                    0.4.2    2018-03-10 [1] CRAN (R 3.5.1)
##  htmltools              0.3.6    2017-04-28 [1] CRAN (R 3.5.1)
##  httr                   1.4.0    2018-12-11 [1] CRAN (R 3.5.1)
##  jsonlite               1.6      2018-12-07 [1] CRAN (R 3.5.1)
##  knitr                  1.22     2019-03-08 [1] CRAN (R 3.5.2)
##  labeling               0.3      2014-08-23 [1] CRAN (R 3.5.0)
##  lattice                0.20-38  2018-11-04 [1] CRAN (R 3.5.2)
##  lazyeval               0.2.1    2017-10-29 [1] CRAN (R 3.5.1)
##  lmtest                 0.9-36   2018-04-04 [1] CRAN (R 3.5.2)
##  lubridate            * 1.7.4    2018-04-11 [1] CRAN (R 3.5.1)
##  magrittr               1.5      2014-11-22 [1] CRAN (R 3.5.1)
##  memoise                1.1.0    2017-04-21 [1] CRAN (R 3.5.1)
##  modelr                 0.1.4    2019-02-18 [1] CRAN (R 3.5.2)
##  munsell                0.5.0    2018-06-12 [1] CRAN (R 3.5.1)
##  nlme                   3.1-137  2018-04-07 [2] CRAN (R 3.5.2)
##  nnet                   7.3-12   2016-02-02 [2] CRAN (R 3.5.2)
##  packrat                0.5.0    2018-11-14 [1] CRAN (R 3.5.1)
##  PerformanceAnalytics * 1.5.2    2018-03-02 [1] CRAN (R 3.5.2)
##  pillar                 1.3.1    2018-12-15 [1] CRAN (R 3.5.1)
##  pkgbuild               1.0.2    2018-10-16 [1] CRAN (R 3.5.1)
##  pkgconfig              2.0.2    2018-08-16 [1] CRAN (R 3.5.1)
##  pkgload                1.0.2    2018-10-29 [1] CRAN (R 3.5.1)
##  plyr                   1.8.4    2016-06-08 [1] CRAN (R 3.5.1)
##  prettyunits            1.0.2    2015-07-13 [1] CRAN (R 3.5.1)
##  processx               3.2.1    2018-12-05 [1] CRAN (R 3.5.1)
##  ps                     1.3.0    2018-12-21 [1] CRAN (R 3.5.2)
##  purrr                * 0.3.1    2019-03-03 [1] CRAN (R 3.5.2)
##  quadprog               1.5-5    2013-04-17 [1] CRAN (R 3.5.2)
##  Quandl                 2.9.1    2018-08-14 [1] CRAN (R 3.5.2)
##  quantmod             * 0.4-13   2018-04-13 [1] CRAN (R 3.5.2)
##  R6                     2.4.0    2019-02-14 [1] CRAN (R 3.5.2)
##  Rcpp                   1.0.0    2018-11-07 [1] CRAN (R 3.5.1)
##  readr                * 1.3.1    2018-12-21 [1] CRAN (R 3.5.2)
##  readxl                 1.3.0    2019-02-15 [1] CRAN (R 3.5.2)
##  remotes                2.0.2    2018-10-30 [1] CRAN (R 3.5.1)
##  reshape2               1.4.3    2017-12-11 [1] CRAN (R 3.5.1)
##  rlang                  0.3.1    2019-01-08 [1] CRAN (R 3.5.2)
##  rmarkdown              1.11     2018-12-08 [1] CRAN (R 3.5.1)
##  rprojroot              1.3-2    2018-01-03 [1] CRAN (R 3.5.1)
##  rstudioapi             0.9.0    2019-01-09 [1] CRAN (R 3.5.1)
##  rvest                  0.3.2    2016-06-17 [1] CRAN (R 3.5.1)
##  scales               * 1.0.0    2018-08-09 [1] CRAN (R 3.5.1)
##  seasonal             * 1.7.0    2018-12-20 [1] CRAN (R 3.5.2)
##  sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 3.5.1)
##  stringi                1.3.1    2019-02-13 [1] CRAN (R 3.5.2)
##  stringr              * 1.4.0    2019-02-10 [1] CRAN (R 3.5.2)
##  sweep                * 0.2.1.1  2018-08-19 [1] CRAN (R 3.5.2)
##  testthat               2.0.1    2018-10-13 [1] CRAN (R 3.5.2)
##  tibble               * 2.0.1    2019-01-12 [1] CRAN (R 3.5.2)
##  tidyquant            * 0.5.5    2018-05-09 [1] CRAN (R 3.5.2)
##  tidyr                * 0.8.3    2019-03-01 [1] CRAN (R 3.5.2)
##  tidyselect             0.2.5    2018-10-11 [1] CRAN (R 3.5.1)
##  tidyverse            * 1.2.1    2017-11-14 [1] CRAN (R 3.5.1)
##  timeDate               3043.102 2018-02-21 [1] CRAN (R 3.5.1)
##  timetk               * 0.1.1.1  2018-08-19 [1] CRAN (R 3.5.2)
##  tseries                0.10-46  2018-11-19 [1] CRAN (R 3.5.2)
##  TTR                  * 0.23-4   2018-09-20 [1] CRAN (R 3.5.2)
##  urca                   1.3-0    2016-09-06 [1] CRAN (R 3.5.2)
##  usethis              * 1.4.0    2018-08-14 [1] CRAN (R 3.5.1)
##  utf8                   1.1.4    2018-05-24 [1] CRAN (R 3.5.1)
##  withr                  2.1.2    2018-03-15 [1] CRAN (R 3.5.1)
##  x13binary              1.1.39-1 2017-05-04 [1] CRAN (R 3.5.2)
##  xfun                   0.5      2019-02-20 [1] CRAN (R 3.5.2)
##  xml2                   1.2.0    2018-01-24 [1] CRAN (R 3.5.1)
##  xts                  * 0.11-2   2018-11-05 [1] CRAN (R 3.5.1)
##  yaml                   2.2.0    2018-07-25 [1] CRAN (R 3.5.1)
##  zoo                  * 1.8-4    2018-09-19 [1] CRAN (R 3.5.1)
## 
## [1] C:/Work/Programs/R/library
## [2] C:/Work/Programs/R/R-3.5.2/library